Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) was submitted on January 28, 2021. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.	
Response to Arguments and Amendments
The amendment filed on July 14, 2022 has been entered. Claims 1-5, 8-19, and 21 are pending in the application. 
The applicant claims that Divakaran does not specifically teach the limitations of  “a tonal language”. The applicant argues that Divakaran discloses a “intonation” instead of “tonal language” that conveys a different meaning to the word. However, the examiner respectfully disagrees with this assertion. The mapping can be found in [0079]: “the interpretation 418 component can determine a person’s current input state from…tone or manner of speaking”. This can be interpreted as tonal language, as the tone component can convey a different meaning to the word. Furthermore, in [0096], Divakaran discloses “The multi-modal input synthesizer 510 can analyze the audio, video, and/or tactile input 502 to determine the content and/or meaning of the input” which implies the use of the system to determine the meaning of the audio input, which can be interpreted as the tonal language.
Hence, the applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 3, 8, and 21 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Divakaran (U.S. Publication No. 20170160813).
Regarding claim 1, Divakaran discloses a method of processing and/or recognizing tones in acoustic signals associated with a tonal language, in a computing device (Figure 18 – Audio Capture Device 1802 [0076] - Voice biometrics describe the characteristics of a person’s voice, such as frequency range, tonal range, Volume, accent, inflections, and so on), the method comprising:
applying a feature vector extractor to an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal ([0178] - Extracting features may include dividing an audio input signal into a number of temporal windows, generally but not always of equal duration. These temporal windows may also be referred to as frames. Acoustic characteristics such as frequency, pitch, tone, etc., can then be determined for each temporal window to identify particular linguistic features present within each temporal window)
and applying at least one runtime model of one or more neural networks to the sequence of feature vectors and producing a sequence of tones as output from the input acoustic signal ([0214] - the speech recognizer may include a neural network-based acoustic model 1816… the deep neural network can be used to associate a input sample 1830 with phonetic content. The deep neural network can produce bottleneck features 181);
wherein the sequence of tones are predicted as probabilities of each feature vector of the sequence of feature vectors representing a part of a tone of the sequence of tones ([0193] - N-gram classifiers can be applied to lexical content to produce a distribution of probabilities over a number of characteristics and emotional states).
Regarding claim 3, Divakaran discloses the method, wherein the sequence of tones are combined with complimentary acoustic vectors obtained from a separate acoustic model ([0206] - A joint’ or “combined' speaker and content model models both person-specific and command-specific acoustic properties of a person's speech. The joint speaker and content model can be implemented using, for example, a phonetic model or a i-Vector).
Regarding claim 8, Divakaran discloses the method, further comprising:
mapping the sequence of feature vectors to the sequence of tones using one or more neural networks to learn at least one model to map the sequence of feature vectors to the sequence of tones ([0148] - the natural language grammars 1024 can help the virtual personal assistant platform 1014, or, more specifically, the interpreter 1016, map the person’s actual natural language dialog input to a user intent. [0178] - Extracting features may include dividing an audio input signal into a number of temporal windows, generally but not always of equal duration. These temporal windows may also be referred to as frames. Acoustic characteristics such as frequency, pitch, tone, etc., can then be determined for each temporal window to identify particular linguistic features present within each temporal window).
Regarding claim 21, Divakaran discloses a speech recognition system comprising ([0025] – speech recognition system):
an audio input device (Figure 18 – Audio Capture Device 1802);
a processor coupled to the audio input device ([0094] - The backend systems 452 can include, for example, computing resources, such as processors);
a memory coupled to the processor, the memory for estimating tones present in an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal by ([0178] - Extracting features may include dividing an audio input signal into a number of temporal windows, generally but not always of equal duration. These temporal windows may also be referred to as frames. Acoustic characteristics such as frequency, pitch, tone, etc., can then be determined for each temporal window to identify particular linguistic features present within each temporal window. [0181] - can be stored, for example, in a database-type structure, including Software modules stored in volatile memory and/or software modules embodied in non-volatile hardware storage systems):
applying a feature vector extractor to an input acoustic signal ([0178] - Extracting features may include dividing an audio input signal into a number of temporal windows, generally but not always of equal duration. These temporal windows may also be referred to as frames. Acoustic characteristics such as frequency, pitch, tone, etc., can then be determined for each temporal window to identify particular linguistic features present within each temporal window)
and outputting a sequence of feature vectors for the input acoustic signal ([0178] - Extracting features may include dividing an audio input signal into a number of temporal windows, generally but not always of equal duration. These temporal windows may also be referred to as frames. Acoustic characteristics such as frequency, pitch, tone, etc., can then be determined for each temporal window to identify particular linguistic features present within each temporal window);
and applying at least one runtime model of one or more networks to the sequence of feature vectors and producing a sequence of tones as output from the input acoustic signal ([0214] - the speech recognizer may include a neural network-based acoustic model 1816… the deep neural network can be used to associate a input sample 1830 with phonetic content. The deep neural network can produce bottleneck features 181);
wherein the sequence of tones are predicted as probabilities of each feature vector of the sequence of feature vectors representing a part of a tone of the sequence of tones ([0193] - N-gram classifiers can be applied to lexical content to produce a distribution of probabilities over a number of characteristics and emotional states).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically taught as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Blandin (U.S. 20170169816).
Regarding claim 2, Divakaran discloses all of the limitations as in claim 1, above.
However, Divakaran does not disclose the method wherein the sequence of tones define a tone posteriorgram.
Blandin does teach the method wherein the sequence of tones define a tone posteriorgram ([0040] - intermediate representations such as phonetic posteriorgrams, lattice representation, and the like can be used for matching phoneme sequences).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran to incorporate the teachings of Blandin in order to implement the method wherein the sequence of tones define a tone posteriorgram. Doing so allows phoneme sequences to be matched (Blandin [0040]).
Regarding claim 4, Divakaran discloses all of the limitations as in claim 3, above.
However, Divakaran does not discloses the method wherein the complimentary acoustic vectors are speech feature vectors or a phoneme posteriorgram.
Blanding does teach the method wherein the complimentary acoustic vectors are speech feature vectors or a phoneme posteriorgram ([0040] - intermediate representations such as phonetic posteriorgrams, lattice representation, and the like can be used for matching phoneme sequences).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran to incorporate the teachings of Blandin in order to implement the method wherein the complimentary acoustic vectors are speech feature vectors or a phoneme posteriorgram. Doing so allows phoneme sequences to be matched (Blandin [0040]).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Blandin (U.S. 20170169816), and further in view of Lou (U.S. Publication No. 20160240210).
Regarding claim 5, Divakaran in view of Blandin teaches all of the limitations as in claim 4, above.
However, Divakaran in view of Blandin does not teach the method wherein the speech feature vectors are provided by one of a Mel-frequency cepstral coefficients (MFCC), a filterbank features (FBANK) technique, or a perceptual linear predictive (PLP) technique.
Lou does teach the method wherein the speech feature vectors are provided by one of a Mel-frequency cepstral coefficients (MFCC), a filterbank features (FBANK) technique, or a perceptual linear predictive (PLP) technique ([0038] - An acoustic feature set can comprise Mel-frequency cepstrum coefficients (MFCC) and/or related characterizations of the speech and/or cleaned speech signals. A set can comprise Perceptual Linear Prediction (PLP) coefficients and/or any other known and/or convenient features).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Blandin to incorporate the teachings of Lou in order to implement the method wherein the speech feature vectors are provided by one of a Mel-frequency cepstral coefficients (MFCC), a filterbank features (FBANK) technique, or a perceptual linear predictive (PLP) technique. Doing so enables operation of an acoustic feature pattern matching engine within ASR (Lou [0038]).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lou (U.S. Publication No. 20160240210).
Regarding claim 9, Divakaran discloses all of the limitations as in claim 1, above.
However, Divakaran does not disclose the method, wherein the feature vector extractor comprises one or more of a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (RNN), a cepstrogram, a spectrogram, a Mel-filtered cepstrum coefficients (MFCC), or a filterbank coefficient (FBANK).
Lou does teach the method, wherein the feature vector extractor comprises one or more of a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (RNN), a cepstrogram, a spectrogram, a Mel-filtered cepstrum coefficients (MFCC), or a filterbank coefficient (FBANK) ([0038] - An acoustic feature set can comprise Mel-frequency cepstrum coefficients (MFCC) and/or related characterizations of the speech and/or cleaned speech signals. A set can comprise Perceptual Linear Prediction (PLP) coefficients and/or any other known and/or convenient features).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran to incorporate the teachings of Lou in order to implement the method, wherein the feature vector extractor comprises one or more of a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (RNN), a cepstrogram, a spectrogram, a Mel-filtered cepstrum coefficients (MFCC), or a filterbank coefficient (FBANK). Doing so enables operation of an acoustic feature pattern matching engine within ASR (Lou [0038]).
Claims 10 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lou (U.S. Publication No. 20160240210), and further in view of Hall (U.S. Publication No. 20180114522).
Regarding claim 10, Divakaran in view of Lou teaches all of the limitations as in claim 9, above.
However, Divakaran in view of Lou does not teach the method, wherein the neural network is a sequence-to-sequence network.
Hall does teach the method, wherein the neural network is a sequence-to-sequence network ([0056] - a more efficient neural network sequence-to-sequence synthesizer may be implemented).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lou to incorporate the teachings of Hall in order to implement the method, wherein the neural network is a sequence-to-sequence network. Doing so allows for increased efficiency in the neural network (Hall [0056]).
Regarding claim 11, Divakaran in view of Lou in view of Hall teaches all of the limitations as in claim 10, above.
However, Divakaran in view of Lou does not teach the method wherein the sequence-to-sequence network comprises one or more of an MLP, a CNN, or an RNN, trained using a loss function appropriate to connectionist temporal classification (CTC) training, encoder-decoder training, or attention training.
Hall does teach the method wherein the sequence-to-sequence network comprises one or more of an MLP, a CNN, or an RNN, trained using a loss function appropriate to connectionist temporal classification (CTC) training, encoder-decoder training, or attention training ([0048] - An alignment vector and neural network recurrent state are then provided to a standard neural network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lou to incorporate the teachings of Hall in order to implement the method wherein the sequence-to-sequence network comprises one or more of an MLP, a CNN, or an RNN, trained using a loss function appropriate to connectionist temporal classification (CTC) training, encoder-decoder training, or attention training. Doing so allows an alignment vector and neural network recurrent state to be provided to a standard neural network, which allows for improved accuracy in pronunciation of TTS systems (Hall [0003], [0048]).
Claims 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lou (U.S. Publication No. 20160240210), in view of Hall (U.S. Publication No. 20180114522), and further in view of Hannun (U.S. Publication No. 20160171974).
Regarding claim 12, Divakaran in view of Lou in view of Hall teaches all of the limitations as in claim 11, above.
However, Divakaran in view of Lou in view of Hall does not teach the method wherein the sequence-to-sequence network has one or more uni-directional or bi-directional recurrent layers.
Hannun does teach the method wherein the sequence-to-sequence network has one or more uni-directional or bi-directional recurrent layers ([0041] - the fourth layer is a bi-directional recurrent network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lou in view of Hall to incorporate the teachings of Hannun in order to implement the method wherein the sequence-to-sequence network has one or more uni-directional or bi-directional recurrent layers. Doing so allows for increased speed in computation and allows for the processing of large speech datasets and scalable RNN training (Hannun [0031]).
Regarding claim 13, Divakaran in view of Lou in view of Hall teaches all of the limitations as in claim 11, above.
However, Divakaran in view of Lou in view of Hall does not teach the method wherein when the sequence to-sequence network is a RNN, the RNN has recurrent units such as long-short term memory (LSTM) or gated recurrent units (GRU).
Hannun does teach the method wherein when the sequence to-sequence network is a RNN, the RNN has recurrent units such as long-short term memory (LSTM) or gated recurrent units (GRU) ([0115] - produced by RNNs and, with Long Short-Term Memory (LSTM) networks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lou in view of Hall to incorporate the teachings of Hannun in order to implement the method wherein when the sequence to-sequence network is a RNN, the RNN has recurrent units such as long-short term memory (LSTM) or gated recurrent units (GRU). Doing so allows for increased speed in computation and allows for the processing of large speech datasets and scalable RNN training (Hannun [0031]).
Regarding claim 14, Divakaran in view of Lou in view of Hall in view of Hannun teaches all of the limitations as in claim 13, above.
However, Divakaran in view of Lou in view of Hall does not teach the method, where the RNN is implemented using one or more of uni-directional or bi-directional LSTM or GRU units.
Hannun does teach the method, where the RNN is implemented using one or more of uni-directional or bi-directional LSTM or GRU units ([0041] - the fourth layer is a bi-directional recurrent network [0115] - produced by RNNs and, with Long Short-Term Memory (LSTM) networks).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lou in view of Hall to incorporate the teachings of Hannun in order to implement the method, where the RNN is implemented using one or more of uni-directional or bi-directional LSTM or GRU units. Doing so allows for increased speed in computation and allows for the processing of large speech datasets and scalable RNN training (Hannun [0031]).
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lunner (U.S. Publication No. 2020005770).
Regarding claim 15, Divakaran discloses all of the limitations as in claim 1, above.
However, Divakaran does not disclose the method further comprising a preprocessing network for computing frames using a Hamming window providing to define a cepstrogram input representation.
Lunner does teach the method further comprising a preprocessing network for computing frames using a Hamming window providing to define a cepstrogram input representation ([0079] - Cepstrograms are calculated, that is, cepstral coefficients of the sound signals and the EEG signals are respectively calculated over short time frames [0094] - a windowing function such as a Hanning window).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran to incorporate the teachings of Lunner in order to implement the method further comprising a preprocessing network for computing frames using a Hamming window providing to define a cepstrogram input representation. Doing so allows for the neural network to better understand and learn the responses of the user (Lunner [0079]).
Claims 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lunner (U.S. Publication No. 2020005770), and further in view of Penn (U.S. Publication No. 20140288928).
Regarding claim 16, Divakaran in view of Lunner teaches all of the limitations as in claim 15, above.
However, Divakaran in view of Lunner does not teach the method further comprising a convolutional neural network for performing nxm convolutions on the cepstrogram and then pooling prior to application of an activation layer.
Penn does teach the method further comprising a convolutional neural network for performing nxm convolutions on the cepstrogram and then pooling prior to application of an activation layer ([0029] - The pooling layer activations may be divided into M bands… by applying this pooling operation every n convolution layer bands).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lunner to incorporate the teachings of Penn in order to implement the method further comprising a convolutional neural network for performing nxm convolutions on the cepstrogram and then pooling prior to application of an activation layer. Doing so allows a smaller number of bands to be obtained in the pooling layer, which provides lower frequency resolution features that may contain more useful information that may be further processed by higher layers in the CNN hierarchy (Penn [0029]).
Regarding claim 17, Divakaran in view of Lunner in view of Penn teaches all of the limitations as in claim 16, above.
However, Divakaran in view of Lunner does not teach the method wherein n = 2, 3, or 4 and m = 3 or 4.
Penn does teach the method wherein n = 2, 3, or 4 and m = 3 or 4 ([0030] - The example shown in FIG. 3 has a pooling layer with a Sub-sampling factor of 2 and a pooling size of 3).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lunner to incorporate the teachings of Penn in order to implement the method wherein n = 2, 3, or 4 and m = 3 or 4. Doing so allows a smaller number of bands to be obtained in the pooling layer, which provides lower frequency resolution features that may contain more useful information that may be further processed by higher layers in the CNN hierarchy (Penn [0029]).
Regarding claim 18, Divakaran in view of Lunner in view of Penn teaches all of the limitations as in claim 16, above.
However, Divakaran in view of Lunner does not teach the method wherein pooling comprises 2x2 pooling, average pooling or 12-norm pooling.
Penn does teach the method wherein pooling comprises 2x2 pooling, average pooling or 12-norm pooling ([0019] - The pooling function may be an average or a maximum function or any other function that aggregates multiple values into a single value).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lunner to incorporate the teachings of Penn in order to implement the method wherein pooling comprises 2x2 pooling, average pooling or 12-norm pooling. Doing so allows a smaller number of bands to be obtained in the pooling layer, which provides lower frequency resolution features that may contain more useful information that may be further processed by higher layers in the CNN hierarchy (Penn [0029]).
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Divakaran (U.S. Publication No. 20170160813) in view of Lunner (U.S. Publication No. 2020005770), in view of Penn (U.S. Publication No. 20140288928), and further in view of Hannun (U.S. Publication No. 20160171974).
Regarding claim 19, Divakaran in view of Lunner in view of Penn teaches all of the limitations as in claim 16, above.
However, Divakaran in view of Lunner in view of Penn does not teach the method wherein activation layers of the one or more neural networks is one of a rectified linear unit (ReLU) activation function using a three-layer network, a sigmoid layer or a tan h layer.
Hannun does teach the method wherein activation layers of the one or more neural networks is one of a rectified linear unit (ReLU) activation function using a three-layer network, a sigmoid layer or a tan h layer ([0040] - the clipped rectified-linear (ReLu) activation function [0047] - three layers (first layer 210, second layer 215, and third layer 220) are non-recurrent layers).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Divakaran in view of Lunner in view of Penn to incorporate the teachings of Hannun in order to implement the method wherein activation layers of the one or more neural networks is one of a rectified linear unit (ReLU) activation function using a three-layer network, a sigmoid layer or a tan h layer. Doing so allows for increased speed in computation and allows for the processing of large speech datasets and scalable RNN training (Hannun [0031]).
	Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Diamos (U.S. Publication No. 20180061439) teaches automatic audio captioning. Newendorp (U.S. Patent No. 9721566) teaches competing devices responding to voice triggers. Naik (U.S. Patent No. 9697822) teaches a system and method for updating an adaptive speech recognition model.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ETHAN DANIEL KIM whose telephone number is (571) 272-1405.  The examiner can normally be reached on Monday - Friday 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ETHAN DANIEL KIM/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658