DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 8, 13, 17 and 24-25 are rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159).

Claim 1,
Aoyama teaches a method for visual speech recognition, the method comprising: receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition in accordance with current values of visual speech recognition parameters to generate, for each output position in an output sequence, a respective ([Fig. 1] [0018] [0020] [0024] [0050-0063] [0073-0074] an image acquisition that acquires a temporal sequence of frames of image data, a detecting unit that detects a lip area and a lip image from each of the frames of the image data, a recognition unit that recognizes a word based on the detected lip images of the lip areas; recognizes a word based on detected lip images of lip areas of the particular face; stores a plurality of visemes, each associated with a particular phoneme, and the recognition unit recognizes a word by comparing the detected lip images of the lip areas to the plurality of visemes stored in the memory; the viseme classifier 31 calculates a K-dimensional score vector corresponding to the lip image input from the lip image generating unit 43 during the utterance period informed by the utterance period detecting unit 44 and outputs the result to the time series feature amount generating unit 45; the K-dimensional score vector is an index indicating which of K (K=19 in this case) kinds of visemes the input lip image corresponds to, and formed with a K-dimensional score representing a probability of corresponding to K kinds of each viseme).
The difference between the prior art and the claimed invention is that Aoyama does not explicitly teach wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers.
Katz teaches wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers ([0095] a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes (e.g. 20 sound classes); each "hidden" layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes; two neural networks may be used--one for initial detection and another as a secondary checker; the output of the acoustic model provides a distribution of scores over phonetic classes for every frame).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama with teachings of Katz by modifying visual lip share recognition as taught by Aoyama to include wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers as taught by Katz for the benefit of using powerful and complicated statistical modeling systems which use probability and mathematical functions (e.g. Hidden Markov Model and neural networks) to determine the most likely outcome ([0093]).

Claims 24-25 contains subject matter similar to claim 1, and thus is rejected under similar rationale.

Claim 2,
Aoyama further teaches the method of claim 1 wherein determining the sequence of words comprises predicting a sequence of phoneme distributions and providing the sequence of phoneme distributions to a decoder to produce the sequence of words ([Fig. 1] [0024] [0167] image processing unit includes a decoder for recognizing, by the recognition unit, a word by comparing the detected lip images of the lip areas to the plurality of visemes; a plurality of visemes, each associated with a particular phoneme).

Claim 8,
Aoyama further teaches wherein determining the sequence of words expressed by the pair of lips depicted in the video using the output scores comprises processing the output scores using a ([Fig. 1] [0024] [0073-0074] [0167] image processing unit includes a decoder for recognizing, by the recognition unit, a word by comparing the detected lip images of the lip areas to the plurality of visemes; a plurality of visemes, each associated with a particular phoneme; the viseme classifier 31 calculates a K-dimensional score vector corresponding to the lip image input from the lip image generating unit 43 during the utterance period informed by the utterance period detecting unit 44 and outputs the result to the time series feature amount generating unit 45).

Claim 13,
Aoyama further teaches the method of claim 1, further comprising training the visual speech recognition neural network, the training comprising: generating training data comprising a plurality of training examples, each training example comprising: (i) a training video comprising a plurality of training video frames, and (ii) a sequence of phonemes from a vocabulary of possible phonemes, the generating comprising, for each training video: obtaining a raw video comprising a plurality of raw video frames and corresponding audio data; determining the sequence of phonemes from the vocabulary of possible phonemes using the audio data; and determining each training video frame based on a face depicted in a respective raw video frame; training the visual speech recognition neural network on the generated training data, comprising determining trained values of visual speech recognition neural network parameters from initial values of visual speech recognition neural network parameters ([Fig. 1] [0025] [0050-0063] learning system 11; a learning function that includes an image separating unit configured to receive an utterance moving image with voice, separate the utterance moving image with voice into an utterance moving image and an utterance voice, and output the utterance moving image and the utterance voice; a face area detecting unit configured to receive the utterance moving image from the image separating unit, split the utterance moving image into frames, detect a face area from each of the frames, and output position information of the detected face area together with one frame of the utterance moving image; a lip area detecting unit configured to receive the position information of the detected face area together with the one frame of the utterance moving image from the face area detecting unit, detect a lip area from the face area of the one frame, and output the position information of the lip area together with the one frame of the utterance moving image; a lip image generating unit configured to receive the position information of the lip area from the lip area detecting unit together with the one frame of the utterance moving image, perform rotation correction for the one frame of the utterance moving image, generate a lip image, and output the lip image to a viseme label adding unit; a phoneme label assigning unit configured to receive the utterance voice from the image separating unit, assign a phoneme label indicating a phoneme to the utterance voice, and output the label; a viseme label converting unit configured to receive the label from the phoneme label assigning unit, convert the phoneme label assigned to the utterance voice for learning into a viseme label indicating the shape of the lip during uttering, and output the viseme label; a viseme label adding unit configured to receive the lip image output from the lip image generating unit and the viseme label output from the viseme label converting unit, add the viseme label to the lip image, and output the lip image added with the viseme label; a learning sample storing unit configured to receive and store the lip image added with the viseme label from the viseme label adding unit, wherein the recognition unit is configured to recognize a word by comparing the detected position of the lip areas from each of the frames of the image data to the data stored by the learning sample storing unit).

Claim 17,
Aoyama further teaches the method of claim 13, wherein determining a training video frame based on a face depicted in a respective raw video frame comprises: detecting the face in the raw video ([0058-0059] the face area detecting unit 22 detects a face area including the face of a person in each frame as shown in FIG. 2A, and outputs position information of the face area of each frame to the lip area detecting unit 23 together with the utterance moving image for learning; the lip area detecting unit 23 detects a lip area including the edge points of the corners of the mouth at the lips from the face area of each frame of the utterance moving image for learning as shown in FIG. 2B, and outputs position information of the lip area of each frame to the lip image generating unit 24 together with the utterance moving image for learning).

Claims 3 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Zhang et al. (US 2016/0350649).

Claim 3,
Aoyama and Katz teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach wherein the volumetric convolutional neural network layers include a plurality of three-dimensional filters.
Zhang teaches wherein the volumetric convolutional neural network layers include a plurality of three-dimensional filters ([0004] [0064] deep convolutional neural networks using three-dimensional filters).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Zhang by modifying the two neural network system as taught by Katz to include a neural network with a three-(Zhang [0004]).

Claim 7,
Aoyama and Katz teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach wherein the visual speech recognition neural network comprises one or more group normalization layers.
Zhang teaches wherein the visual speech recognition neural network comprises one or more group normalization layers ([0004] [0029] neural network can be described as stacks of layers, interlaced with normalization layer).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Zhang by modifying the neural network system as taught by Katz to include one or more group normalization layers as taught by Zhang for the benefit of significant boosting performance over traditional computational methods (Zhang [0004]).

Claims 4-6 are rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Cao et al. (US 2019/0130628).

Claim 4,
Aoyama and Katz teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach wherein the time- aggregation neural network layers comprise one or more recurrent neural network layers.
([0099] recurrent neural network).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Cao by modifying the two neural network system as taught by Katz to include a neural network comprising recurrent neural network layers as taught by Cao for the benefit of allowing forward and backward connections between neurons. BLSTM are well-suited for the classification, processing, and prediction of time series, given time lags of unknown size and duration between events (Cao [0099]).

Claim 5,
Cao further teaches the method of claim 4, wherein the recurrent neural network layers comprise one or more long short-term memory neural network layers ([0099] bi-directional long-short term memory).

Claim 6,
Cao further teaches the method of claim 5, wherein one or more of the long short-term memory neural network layers are bi-directional long short-term memory neural network layers ([0099] bi-directional long-short term memory).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Cortez et al. (US 2011/0131041).

Claim 12,

Cortez teaches wherein the visual speech recognition neural network includes at least five volumetric convolutional neural network layers ([0093] neural network with N number of layers).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Cortez by modifying the neural networks as taught by Katz to include at least five volumetric convolutional neural network layers as taught by Cortez for the benefit of improve its robustness and computational effort in order to make them usable in devices (Cortez [0066]).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Hofer et al. (US 2016/0098986).

Claim 9,
Aoyama and Katz teach all the limitations in claim 8. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach wherein the decoder comprises a finite state transducer.
Hofer teaches wherein the decoder comprises a finite state transducer ([0025] decoder is a weighted finite sate transducer).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Hofer by modifying visual lip share recognition as taught by Aoyama to include a decoder comprising a finite (Hofer [0025]).

Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Kabel et al. (US 6,076,039).

Claim 10,
Aoyama teaches mapping phonemes to words, comprising using a language model ([0091] the registration system 12 of the utterance recognition device 10 generates time series feature amounts corresponding to the utterance moving image for registration by executing a registration process, performs modeling using the HMM, and registers models of the time series feature amounts by associating the amounts with utterance words for registration in the learning database 48).
The difference between the prior art and the claimed invention is that Aoyama nor Katz teach removing duplicate phonemes and blanks.
Kabel teaches removing duplicate phonemes and blanks ([Abstract] eliminating spaces, vowels, one consonant of a pair of double consonants, one vowel of a pair of double vowels, and/or one or more special characters).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Kabel by modifying the decoder of visual lip share recognition as taught by Aoyama to include removing duplicate phonemes and blanks as taught by Kabel for the benefit of reducing the geographic name to desired number of characters (Kabel [Abstract]).


Aoyama further teaches the method of claim 10, wherein the language model is an n-gram language model with backoff ([0091] HMM).

Claims 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Basu et al. (US 2003/0018475).

Claim 14,
Aoyama and Katz teach all the limitations in claim 13. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach obtaining a transcript of the raw video; determining an alignment of the transcript and the audio data using a trained automatic speech recognition algorithm; and determining the sequence of phonemes from the aligned transcript.
Basu teaches obtaining a transcript of the raw video; determining an alignment of the transcript and the audio data using a trained automatic speech recognition algorithm; and determining the sequence of phonemes from the aligned transcript ([0099-0100] the uttered speech to be verified may be decoded by classical speech recognition techniques so that a decoded script and associated time alignments are available; this is accomplished using the feature data from the acoustic feature extractor 14; in step 824, the visual speech feature vectors from the visual feature extractor 22 are used to produce a visual phonemes (visemes) sequence; in step 826, the script is aligned with the visemes; in step 828, a likelihood on the alignment is computed to determine how well the script aligns to the visual data; the results of the likelihood are then used, in step 830, to decide whether an actual speech event occurred or is occurring and whether the information in the paths needs to be recognized).
(Basu [0003]).

Claim 15,
Katz further teaches the method of claim 14, further comprising determining the transcript is expressed in a specific natural language ([0092] identifying the appropriate language for ASR).

Claim 16,
Aoyama further teaches the method of claim 13, further comprising determining that a quality measure of the raw video exceeds a minimum threshold ([0071] the pixel difference feature can be obtained by calculating the difference in pixel values (luminance values) I1 and I2 (I1-I2) of two pixels on an image (a lip image in this case); in a binary classification weak classifier h(x) corresponding to each combination of the two pixels, as shown in Formula (1) shown below, true (+1) or false (-1) is determined by the pixel difference feature I1-I2 and a threshold value Th).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Aoyama et al. (US 2010/0332229) in view of Katz et al. (US 2018/0310159) and further in view of Cohen et al. (US 2018/0308276).


Aoyama and Katz teach all the limitations in claim 17. The difference between the prior art and the claimed invention is that Aoyama nor Katz teach smoothing the plurality of landmarks on the face.
Cohen teaches smoothing the plurality of landmarks on the face ([0052] smoothing plurality of facial landmarks).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama and Katz with teachings of Cohen by modifying visual lip shape recognition as taught by Aoyama to include smoothing plurality of landmarks on the face as taught by Cohen for the benefit of providing an efficient way to create or animate a photorealistic three-dimensional character from a two-dimensional image (Cohen [0003]).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit 

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656