DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant's amendments with respect to 35 U.S.C. 103 rejection of claims 1 and 24-25 have been considered and found persuasive, and the rejection has been withdrawn. See detailed reason for allowance below.
See Examiner’s Interview for Examiner’s Amendment. See detailed Examiner’s Amendment below.

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with John Treilhard on 4/18/2022.
The application has been amended as follows: 

Claim 1,	 
A method for visual speech recognition, the method comprising:
	receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips;
	processing the video using a visual speech recognition neural network in accordance with current values of visual speech recognition neural network parameters to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens,
	wherein the visual speech recognition neural network comprises: (i) a three-dimensional (3D) convolutional subnetwork comprising a sequence of multiple volumetric convolutional neural network layers, and (ii) a temporal subnetwork; 
	wherein the 3D convolutional subnetwork processes the plurality of video frames depicting the pair of lips using a plurality of three-dimensional (3D) convolutional filters of the sequence of multiple volumetric convolutional neural network layers to generate a respective spatio-temporal feature tensor for each video frame of the plurality of video frames depicting the pair of lips;
	wherein the temporal subnetwork processes the spatio-temporal feature tensors corresponding to the video frames depicting the pair of lips to generate, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens;
	wherein the vocabulary of possible tokens comprises a plurality of phonemes; and
	determining a sequence of words expressed by the pair of lips depicted in the video using, for each output position in the output sequence, the respective output score[[s]] for each token in the vocabulary of possible tokens.

Claim 8,
The method of claim 1, wherein determining the sequence of words expressed by the pair of lips depicted in the video , for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens 

Claim 24,
A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for visual speech recognition, the operations comprising: 
	receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips;
	processing the video using a visual speech recognition neural network in accordance with current values of visual speech recognition neural network parameters to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens,
	wherein the visual speech recognition neural network comprises: (i) a three-dimensional (3D) convolutional subnetwork comprising a sequence of multiple volumetric convolutional neural network layers, and (ii) a temporal subnetwork; 
	wherein the 3D convolutional subnetwork processes the plurality of video frames depicting the pair of lips using a plurality of three-dimensional (3D) convolutional filters of the sequence of multiple volumetric convolutional neural network layers to generate a respective spatio-temporal feature tensor for each video frame of the plurality of video frames depicting the pair of lips;
	wherein the temporal subnetwork processes the spatio-temporal feature tensors corresponding to the video frames depicting the pair of lips to generate, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens;
	wherein the vocabulary of possible tokens comprises a plurality of phonemes; and
	determining a sequence of words expressed by the pair of lips depicted in the video using, for each output position in the output sequence, the respective output score[[s]] for each token in the vocabulary of possible tokens.

Claim 25,
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for visual speech recognition, the operations comprising: 
	receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips;
	processing the video using a visual speech recognition neural network in accordance with current values of visual speech recognition neural network parameters to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens,
	wherein the visual speech recognition neural network comprises: (i) a three-dimensional (3D) convolutional subnetwork comprising a sequence of multiple volumetric convolutional neural network layers, and (ii) a temporal subnetwork; 
	wherein the 3D convolutional subnetwork processes the plurality of video frames depicting the pair of lips using a plurality of three-dimensional (3D) convolutional filters of the sequence of multiple volumetric convolutional neural network layers to generate a respective spatio-temporal feature tensor for each video frame of the plurality of video frames depicting the pair of lips;
	wherein the temporal subnetwork processes the spatio-temporal feature tensors corresponding to the video frames depicting the pair of lips to generate, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens;
	wherein the vocabulary of possible tokens comprises a plurality of phonemes; and
	determining a sequence of words expressed by the pair of lips depicted in the video using, for each output position in the output sequence, the respective output score[[s]] for each token in the vocabulary of possible tokens.

Allowable Subject Matter
Claims 1-2, 4-18 and 24-26 are allowed.
The following is an examiner’s statement of reasons for allowance: Aoyama et al. (US 2010/0332229) ([Fig. 1] [0018] [0020] [0024] [0050-0063] [0073-0074]) in view of Chung et al. (“Lip Reading in the Wild”) ([3.1 Architecture]) in view of Torfi et al. (“3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition”) ([V. Architecture] [A. Visual Network]) teach receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network in accordance with current values of visual speech recognition neural network parameters to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises: (i) a three-dimensional (3D) convolutional subnetwork comprising a sequence of multiple volumetric convolutional neural network layers, and (ii) a temporal subnetwork; wherein the 3D convolutional subnetwork processes the plurality of video frames depicting the pair of lips using a plurality of three-dimensional (3D) convolutional filters of the sequence of multiple volumetric convolutional neural network layers to generate a respective spatio-temporal feature tensor for each video frame of the plurality of video frames depicting the pair of lips; wherein the vocabulary of possible tokens comprises a plurality of phonemes.
The difference between the prior art and the claimed invention is that Aoyama, Chung nor Torfi explicitly teach wherein the temporal subnetwork processes the spatio-temporal feature tensors corresponding to the video frames depicting the pair of lips to generate, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens; and determining a sequence of words expressed by the pair of lips depicted in the video using, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens.
Therefore, it would not have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Aoyama, Chung and Torfi to include wherein the temporal subnetwork processes the spatio-temporal feature tensors corresponding to the video frames depicting the pair of lips to generate, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens; and determining a sequence of words expressed by the pair of lips depicted in the video using, for each output position in the output sequence, the respective output score for each token in the vocabulary of possible tokens. Therefore, the claimed invention is deemed novel.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Sutton et al. (US 6,539,354) – Methods and Devices for producing and using synthetic visual speech based on natural coarticulation
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/Examiner, Art Unit 2656