Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawing submitted on 10/02/2020 is considered by the examiner.
Claim Objections
Claim 1, is objected to because of the following informalities:  Claim 1, line 1 recites the word "A singe…" which should be corrected to "A single…".  Appropriate correction is required. For examination purpose examiner corrected the word “singe” to “single”.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.



Claim(s) 1-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Holm (US 2021/0065712 A1).

Regarding Claims 1, and 11, Holm teaches: A single audio-visual automated speech recognition (AV-ASR) (speech processing module) model for transcribing speech from audio-visual data, the AV-ASR model comprising (Abstract- Systems and methods for processing speech are described. Certain examples use visual information to improve speech processing. This visual information may be image data obtained from within a vehicle. In examples, the image data features a person within the vehicle. Certain examples use the image data to obtain a speaker feature vector for use by an adapted speech processing module. The speech processing module may be configured to use the speaker feature vector to process audio data featuring an utterance. The audio data may be audio data derived from an audio capture device within the vehicle. Certain examples use neural network architectures to provide acoustic models to process the audio data and the speaker feature vector. ): an encoder frontend comprising an attention mechanism configured to ([0083] The neural network architecture 522 outputs at least one speaker feature vector 525, where the speaker feature vector 525 may be derived and/or used as described in any of the other examples. FIG. 5 shows a case where the image data 545 includes a plurality of frames, e.g., from a video camera, wherein the frames feature a facial area of a person. Accordingly, a plurality of speaker feature vectors 525 may be computed using the neural network architecture 522, e.g., one for each input frame of image data. In other embodiments, there may be a many-to-one relationship between frames of input data and a speaker feature vector. It should be noted that using recurrent neural network systems, samples of the input image data 545 and the output speaker feature vectors 525 need not be temporally synchronized, e.g., a recurrent neural network architecture may act as an encoder (or integrator) over time. In one embodiment, the neural network architecture 522 is configured to generate an x-vector as described above. In another embodiment, an x-vector generator is configured to receive image data 545, to process the image data 545 using a convolutional neural network architecture and then to combine the output of the convolutional neural network architecture with an audio-based x-vector. In another embodiment, known x-vector configurations are extended to receive image data as well as audio data and to generate a single speaker feature vector that embodies information from both modal pathways. [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. In another embodiment, the speech processing module 930 may receive the vector portions 926 and 928 separately yet treat the vector portions as different portions of a speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms.): receive an audio track of the audio-visual data and a video portion of the audio-visual data, the video portion of the audio-visual data comprising a plurality of video face tracks, each video face track of the plurality of video face tracks is associated with a face of a respective person ([0008] In accordance with the various aspects of the invention, a speaker feature vector is obtained using image data that features a facial area of a talking person. This speaker feature vector is provided as an input to a neural network architecture of an acoustic model, wherein the acoustic model is configured to use this input as well as audio data featuring the utterance. In this manner, the acoustic model is provided with additional vision-derived information that the neural network architecture may use to improve the parsing of the utterance, e.g., to compensate for the detrimental acoustic and noise properties within a vehicle. For example, configuring an acoustic model based on a particular person, and/or the mouth area of that person, as determined from image data, may improve the determination of ambiguous phonemes, e.g., that without the additional information may be erroneously transcribed based on vehicle conditions. [0062] In the example of FIG. 3, the speaker preprocessing module 320 receives image data 345 that features a facial area of a person. The person includes a driver or passenger in a vehicle as described above. The face recognition module 370 performs facial recognition on the image data 345 to identify the person, e.g., the driver or passenger within the vehicle. The face recognition module 370 includes any combination of hardware and software to perform the facial recognition.[0065] In another case, the face recognition module 370 may be configured to receive multiple images from multiple image capture devices, where each image includes an associated flag to indicate whether it is to be used to identify a currently speaking person or user. In this manner, the speech processing apparatus 300 of FIG. 3 may be used to identify a speaker from a plurality of people within a vehicle and configure the speech processing module 330 to the specific characteristics of that speaker.); and for each video face track of the plurality of video face tracks, determine a confidence score indicating a likelihood that the face of the respective person associated with the video face track comprises a speaking face of the audio track ([0068] In accordance with various aspects and embodiments, the vehicle includes multiple image capture devices and multiple audio capture devices. As such, the speaker preprocessing module 320 provides further functionality to determine an appropriate facial area from one or more captured images. In another case, the face recognition module 370 may be configured to receive multiple images from multiple image capture devices, where each image includes an associated flag to indicate whether it is to be used to identify a currently speaking person or user. In this manner, the speech processing apparatus 300 of FIG. 3 may be used to identify a speaker from a plurality of people within a vehicle and configure the speech processing module 330 to the specific characteristics of that speaker. [0089] In one embodiments, the speaker feature vector 625 includes a classification of a person within a vehicle. For example, the speaker feature vector 625 is derived from the user identifier 376 output by the face recognition module 370 in FIG. 3. In another case, the speaker feature vector 625 includes a classification and/or set of probabilities output by a neural speaker preprocessing module such as module 520 in FIG. 5. In the latter case, the neural speaker preprocessing module includes a SoftMax layer that outputs “probabilities” for a set of potential users (including a classification for “unrecognized”). In this case, one or more frames of input image data 545 may result in a single speaker feature vector 525. [0100] In one embodiment, the feature retrieval component 922 receives the first set of image data 962 and outputs a vector portion 926 that includes of one or more of an i*-vector and an x-vector (e.g., as described above). In accordance with one aspect, the feature retrieval component 922 receives a single image per utterance. The lip feature extractor 924 and the speech processing module 930 receive a plurality of frames over the time of the utterance. In one case, if a facial recognition performed by the feature retrieval component 922 has a confidence value that is below a threshold, the first set of image data 962 may be updated (e.g., by using another/current frame of video) and the facial recognition reapplied until a confidence value meets a threshold (or a predefined number of attempts is exceeded). [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. In another embodiment, the speech processing module 930 may receive the vector portions 926 and 928 separately yet treat the vector portions as different portions of a speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms. ); and a decoder (speech processing module 130) configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result of the audio track ([0045] FIG. 1B is a schematic illustration of the speech processing apparatus 120 shown in FIG. 1A. In FIG. 1B, the speech processing apparatus 120 includes a speech processing module 130, an image interface 140 and an audio interface 150. The image interface 140 is configured to receive image data 145. The image data 145 includes image data captured by the image capture device 110 in FIG. 1A. The audio interface 150 is configured to receive audio data 155. The audio data 155 includes audio data captured by the audio capture device 116 in FIG. 1A. The speech processing module 130 is in communication with both the image interface 140 and the audio interface 150. The speech processing module 130 is configured to process the image data 145 and the audio data 155 to generate a set of linguistic features 160 that are useable to parse an utterance of the person 102. The linguistic features 160 includes phonemes, word portions (e.g., stems or proto-words), and words (including text features such as pauses that are mapped to punctuation), as well as probabilities and other values that relate to these linguistic units. In one case, the linguistic features 160 may be used to generate a text output that represents the utterance. [0060] The arrangement of FIG. 2 allows the speech processing module 230 to be configured or adapted based on speaker features determined based on the image data 245. This provides additional information to the speech processing module 230 such that it may select linguistic features that are consistent with a particular speaker, e.g., by exploiting correlations between appearance and acoustic characteristics.).

Regarding Claims 2 and 12, Holm teaches: The AV-ASR model of claim 1, wherein the single AV-ASR model comprises a sequence-to-sequence model(See rejection of claim 1 and [0058] In accordance with one embodiment, the speech processing module 230 includes an acoustic model and the acoustic model includes a neural network architecture. For example, the acoustic model includes one or more of: a Deep Neural Network (DNN) architecture with a plurality of hidden layers; a hybrid model comprising a neural network architecture and one or more of a Gaussian Mixture Model (GMM) and a Hidden Markov Model (HMM); and a Connectionist Temporal Classification (CTC) model, e.g., comprising one or more recurrent neural networks that operates over sequences of inputs and generates sequences of linguistic features as an output.  [0101] The lip feature extractor 924 may output a vector portion for each input frame of image data 964 and/or may encode features over time steps using a recurrent neural network architecture (e.g., using a Long Short Term Memory—LSTM—or Gated Recurrent Unit—GRU) or a “transformer” architecture. In the latter case, an output of the lip feature extractor 924 includes one or more of a hidden state of a recurrent neural network and an output of the recurrent neural network. Note: seq2seq takes as input a sequence of words(sentence or sentences) and generates an output sequence of words. It does so by use of the recurrent neural network (RNN).).

Regarding Claims 3 and 13, Holm teaches: The AV-ASR model of claim 1, wherein the single AV-ASR model comprises an Audio-Visual Recurrent Neural Network Transducer (RNN-T) model (See rejection of claim 2 and [0080] In accordance with other embodiments, the speech processing module 400 of FIG. 4 includes one or more recurrent connections. In one embodiment, the acoustic model includes recurrent models, e.g. LSTMs. In another embodiment, there may be feedback between modules. In FIG. 4 there is a dashed line indicating a first recurrent coupling between the utterance parser 436 and the language model 434 and a dashed line indicating a second recurrent coupling between the language model 434 and the acoustic model 432. In this embodiment, a current state of the utterance parser 436 may be used to configure a future prediction of the language model 434 and a current state of the language model 434 may be used to configure a future prediction of the acoustic model 432. The recurrent coupling is omitted in certain embodiments to simplify the processing pipeline and allow for easier training. In one case, the recurrent coupling is used to compute an attention or weighting vector that is applied at a next time step. Note: RNN-T is LSTM based RNN.).

Regarding Claims 4 and 14, Holm teaches: The AV-ASR model of claim 1, wherein the single AV-ASR model does not include a separate face selection system for hard-selecting which video face track of the plurality of video face tracks comprises the speaking face of the audio track (See rejection of claim 1 and [0089] In one embodiments, the speaker feature vector 625 includes a classification of a person within a vehicle. For example, the speaker feature vector 625 is derived from the user identifier 376 output by the face recognition module 370 in FIG. 3. In another case, the speaker feature vector 625 includes a classification and/or set of probabilities output by a neural speaker preprocessing module such as module 520 in FIG. 5. In the latter case, the neural speaker preprocessing module includes a SoftMax layer that outputs “probabilities” for a set of potential users (including a classification for “unrecognized”). In this case, one or more frames of input image data 545 may result in a single speaker feature vector 525. [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms.).

Regarding Claims 5 and 15, Holm teaches: The AV-ASR mode of claim 1, wherein the attention mechanism is configured to generate as output an attention-weighted visual feature vector for the plurality of video face tracks, the attention-weighted visual feature vector representing a soft-selection of the video face track of the plurality of video face tracks that includes the face of the respective person with the highest likelihood of comprising the speaking face of the audio track (See rejection of claim 4 and [0056] For example, the speaker preprocessing module 220 may compute a compressed or dense numeric representation of salient information within the image data 245. In accordance with one aspect, the computation is determined based on a set of parameters, such as a set of weights, biases and/or probability coefficients. [0100] In one embodiment, the feature retrieval component 922 receives the first set of image data 962 and outputs a vector portion 926 that includes of one or more of an i*-vector and an x-vector (e.g., as described above). In one case, if a facial recognition performed by the feature retrieval component 922 has a confidence value that is below a threshold, the first set of image data 962 may be updated (e.g., by using another/current frame of video) and the facial recognition reapplied until a confidence value meets a threshold (or a predefined number of attempts is exceeded). [0101] The lip feature extractor 924 receives the second set of image data 964. The second set of image data 964 includes cropped frames of image data that focus on a mouth or lip area. The lip feature extractor 924 may receive the second set of image data 964 at a frame rate of an image capture device and/or at a subsampled frame rate (e.g., every 2 frames). The lip feature extractor 924 outputs a set of vector portions 928. These vector portions 928 include an output of an encoder that includes a neural network architecture. The lip feature extractor 924 may output a vector portion for each input frame of image data 964 and/or may encode features over time steps using a recurrent neural network architecture (e.g., using a Long Short Term Memory—LSTM—or Gated Recurrent Unit—GRU) or a “transformer” architecture. In the latter case, an output of the lip feature extractor 924 includes one or more of a hidden state of a recurrent neural network and an output of the recurrent neural network. [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. In another embodiment, the speech processing module 930 may receive the vector portions 926 and 928 separately yet treat the vector portions as different portions of a speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms.).

Regarding Claims 6 and 16, Holm teaches: The AV-ASR model of claim 1, wherein the attention mechanism comprises a SoftMax layer having an inverse temperature parameter configured to cause the attention mechanism to converge to a hard-decision rule of selecting the video face track of the plurality of video face tracks associated with the highest confidence score as the speaking face of the audio track (See rejection of claim 5 and [0071] For example, the vector generator 372 includes one or more Deep Neural Network layers that are configured to receive one or more frames of audio data 355 and output a fixed length vector output (e.g., one vector per language). One or more pooling, non-linear functions and SoftMax layers may also be provided. [0089] In one embodiments, the speaker feature vector 625 includes a classification of a person within a vehicle. For example, the speaker feature vector 625 is derived from the user identifier 376 output by the face recognition module 370 in FIG. 3. In another case, the speaker feature vector 625 includes a classification and/or set of probabilities output by a neural speaker preprocessing module such as module 520 in FIG. 5. In the latter case, the neural speaker preprocessing module includes a SoftMax layer that outputs “probabilities” for a set of potential users (including a classification for “unrecognized”). In this case, one or more frames of input image data 545 may result in a single speaker feature vector 525. [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. In another embodiment, the speech processing module 930 may receive the vector portions 926 and 928 separately yet treat the vector portions as different portions of a speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms. Note: It is inherent for SoftMax layer having an inverse temperature parameter since the SoftMax function transforms estimates of the value of different options into choice probabilities. The inverse temperature determines the extent to which differences in the value of different options are scaled.).

Regarding Claims 7 and 17, Holm teaches: The AV-ASR model of claim 1, wherein the encoder frontend is trained on a training data set comprising: a training audio track comprising one or more spoken utterances; a first training video face track comprising a correct speaking face of the one or more spoken utterances of the training audio track; and one or more second training video face tracks, each second training video face track comprising an incorrect speaking face of the one or more spoken utterances of the training audio track (See rejection of claim 1 and [0056] The speaker preprocessing module 220 in this case may implement an information bottleneck to compute the speaker feature vector 225. In accordance with one aspect, the computation is determined based on a set of parameters, such as a set of weights, biases and/or probability coefficients. Values for these parameters may be determined via a training phase that uses a set of training data. [0085] In the example of FIG. 5, the neural network architectures of the neural speaker preprocessing module 520 and the neural speech processing module 530 may be jointly trained. In this case, a training set includes frames of image data 545, frames of audio data 555 and ground truth linguistic features (e.g., ground truth phoneme sequences, text transcriptions or voice command classifications and command parameter values). Both the neural speaker preprocessing module 520 and the neural speech processing module 530 may be training in an end-to-end manner using this training set. Parameters for both neural network architectures may then be determined using gradient descent approaches. In this manner, the neural network architecture 522 of the neural speaker preprocessing module 520 may “learn” parameter values (such as values for weights and/or biases for one or more neural network layers) that generate one or more speaker feature vectors 525 that improve at least acoustic processing in an in-vehicle environment, where the neural speaker preprocessing module 520 learns to extract features from the facial area of a person that improves the accuracy of the output linguistic features. [0089] In one embodiments, the speaker feature vector 625 includes a classification of a person within a vehicle. For example, the speaker feature vector 625 is derived from the user identifier 376 output by the face recognition module 370 in FIG. 3. In another case, the speaker feature vector 625 includes a classification and/or set of probabilities output by a neural speaker preprocessing module such as module 520 in FIG. 5. In the latter case, the neural speaker preprocessing module includes a SoftMax layer that outputs “probabilities” for a set of potential users (including a classification for “unrecognized”).).

Regarding Claims 8 and 18, Holm teaches:  The AV-ASR model of claim 7, wherein, during training, the attention mechanism is configured to learn how to gate the first training video face track as the correct speaking face of the one or more spoken utterances of the training audio track (See rejection of claim 7).

Regarding Claims 9 and 19, Holm teaches:  The AV-ASR model of claim 7, wherein the attention mechanism is trained with cross entropy loss (See rejection of claim 7 and [0102] In FIG. 9, the speech processing module 930 receives the vector portions 926 from the feature retrieval component 922 and the vector portions 928 from the lip feature extractor 924 as inputs. In one embodiment, the speaker preprocessing module 920 may combine the vector portions 926 and 928 into a single speaker feature vector. In another embodiment, the speech processing module 930 may receive the vector portions 926 and 928 separately yet treat the vector portions as different portions of a speaker feature vector. The vector portions 926 and 928 may be combined into a single speaker feature vector by one or more of the speaker preprocessing module 920 and the speech processing module 920 using, for example, concatenation or more complex attention-based mechanisms. If the sample rates of one or more of the vector portions 926, the vector portions 928 and the frames of audio data 955 differ then a common sample rate may be implemented by, for example, a receive-and-hold architecture (where values that more vary more slower are held constant at a given value until a new sample values are received), a recurrent temporal encoding (e.g., using LSTMs or GRUs as above) or an attention-based system where an attention weighting vector changes per time step. Note: Neural Network architecture with attention mechanism, training with cross entropy loss is inherent.).

Regarding Claims 10 and 20, Holm teaches:. The AV-ASR model of claim 1, wherein the decoder is configured to emit the speech recognition result of the audio track in real time to provide a streaming transcription of the audio track (See rejection of claim 1 and [0045] In one case, the linguistic features 160 may be used to generate a text output that represents the utterance.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art of record Kim et al.(Us 2018/0268812 A1) teach:  A synchronized video data and audio data is received. A sequence of frames of the video data that includes images corresponding to lip movement on a face is determined. The audio data is endpointed based on first audio data that corresponds to a first frame of the sequence of frames and second audio data that corresponds to a last frame of the sequence of frames. A transcription of the endpointed audio data is generated by an automated speech recognizer. The generated transcription is then provided for output.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878. The examiner can normally be reached Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656