DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Introduction
This office action is in response to communications filed on 05/16/2022. Claims 9-28 are pending, and likewise Claims 9-28 have been examined.

Response to Amendment
Amendment filed 02/15/2022 has been considered by examiner. The amendments to claims 9, 16 and 23 with regard to the claim objections has been considered, the objections have been withdrawn.

Response to Arguments
Applicant’s arguments, see Remarks, filed 05/16/2022, with respect to the rejection(s) of claim(s) 9-28 under 102(a)(2) have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Hoffmeister (US 10332508 B1) and further in view of Piersol et al. (US 10192546 B1).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 9-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoffmeister (US 10332508 B1), and further in view of Piersol et al. (US 10192546 B1).

Regarding Claim 9:
Hoffmeister teaches a computer-implemented method for generating aspects of utterance(Fig 18, shows computer with processor, memory, storage, IO), 
the method comprising: receiving an input speech(Col 3, Ln 9-10, receiving a spoken user query), 
the input speech comprising a speech uttered by a speaker and a noise(Col 3, Ln 9-10, receiving a spoken user query. Col 5, Ln 36-38, speech from background noise. Therefore, noise is included in input); 
detecting, based the input speech, an utterance, the utterance corresponding to the speech uttered by the speaker(Col 5, Ln 27-32, determine whether speech is present in audio input); 
extracting an acoustic feature from the uttered speech(Col 5, Ln 30-35, signal to noise ratio. Also Col 15, Ln 19-23, For ASR processing...acoustic features....LFBE...MFCC or other features, are determined and used); 
generating a set of speech recognition results with recognition scores based on the uttered speech(Col 6, Ln 49-59, utterance assigned...confidence score); 
generating a set of speech-recognition-result word vector expressions and a set of speech- recognition-result part-of-speech vector expressions based on the set of speech recognition results with recognition scores(Col 15, Ln 34-45, input....the one hot is augmented... .including but not limited to...word embeddings, POS tagger. Input is shown to be based off of set of results and scores as shown above in Col 6, Ln 49-59 citation); 
generating a target utterance estimation model based on the extracted acoustic feature, the generated set of speech recognition results with recognition scores, the generated set of speech- recognition-result word vector expressions, an utterance time length of the uttered speech, and the generated set of speech-recognition-result part-of-speech vector expressions(Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 5, Ln 29-35, signal to noise ratio is used to detect speech in input, Col 15, Ln 19-23, For ASR processing... acoustic features....LFBE...MFCC or other features, are determined and used. The model is based on the ASR output which uses these features. Also, Col 14, Ln 10-15, ASR component may be configured to output data calculated by the ASR component during processing.....use such data to confirm results of ASR. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..)); 
 providing, by the generated target utterance estimation model, a probability of the uttered speech detected from the input speech being an utterance suitable for a predetermined purpose(Col 16, Ln 28-29, the above techniques may be used to assign a confidence score to an ASR result. Col 4, Ln 32-33, executes command associated with NLU results. Reference provides examples like Col 16, Ln 50-55, confirm wakeword detection),
wherein the generated target utterance estimate model predicts …… based at least on a combination of the generated set of speech-recognition-result word vector expressions and the utterance time length of the uttered speech(Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..)).
Hoffmeister does not teach wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise… wherein the predetermined purpose excludes the sudden noise: and2U.S. Patent Application Serial No. 16/968,126Amendment dated May 16, 2022 Reply to Office Action of February 15, 2022causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise.
In the same field of ASR, Piersol teaches wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored), 
wherein the predetermined purpose excludes the sudden noise (Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored): 
and2U.S. Patent Application Serial No. 16/968,126Amendment dated May 16, 2022 Reply to Office Action of February 15, 2022causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Hoffmeister, with the sudden noise detection of Piersol, as it prevents malfunctions of non-speech sounds being detected as input speech(Col 14, Ln 57-64).

Regarding Claim 10:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches wherein the target utterance estimation model is based at least on a sequence of a combination of a recognition score of a word based at least on the set of speech recognition results with recognition scores(Col 14, Ln 10-15, data calculated by the ASR component during processing, such data may include probabilities associated with certain words, Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Also, Col 14, Ln 10-15, ASR component may be configured to output data calculated by the ASR component during processing.....use such data to confirm results of ASR),
a word vector of the word based at least on the set of speech-recognition-result word vector expressions(Col 15, Ln 34-45, for NLU processing... one hot vector augmented ....word embeddings that represent how individual words are used in a text corpus),
a part-of-speech vector of the word based on the set of speech-recognition-result part-of-speech vector expressions(Col 15, Ln 34-45, for NLU processing...one hot vector augmented....labels from a tagger e.g. part-of-speech),
and an acoustic feature of the word based on the acoustic feature of the uttered speech(Col 15, Ln 19-33, For ASR processing... acoustic features....LFBE...MFCC or other features, are determined and used... Alignments can be provided at the level of senons, phones, or any other level suitable. Since NLU uses the ASR output which uses the acoustic feature, the model is based on it. Also, Col 14, Ln 10-15, ASR component may be configured to output data calculated by the ASR component during processing.....use such data to confirm results of ASR).

Regarding Claim 11:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches the method further comprising: rejecting the input speech as a background noise based on the probability of the uttered speech from the input speech being the utterance suitable for a predetermined purpose(Col 16, Ln 43-45, if result is not correct or has confidence score below the threshold, the system requests the user to restate),
wherein the predetermined purpose includes a spoken dialogue(Col 4, Ln 32-33, executes command associated with NLU results).

Regarding Claim 12:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches wherein the target utterance estimation model is a model learned by a neural network(Abstract, Ln 6-8, the feature vector may be used with a trained classifier to confirm ASR results or assign confidence score. Col 16, Ln 2-4, to configure a classifier to operate on encoded data a DNN may be used),
the neural network processing time-series data(Col 15, Ln 34-35, for NLU processing the base input is typically text in the form of word sequences).

Regarding Claim 13:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches the method further comprising: receiving, by the target utterance estimation model, a correct answer of the input speech for training the target utterance estimation model(Col 20, Ln 64-66, The classifier and encoders may be trained using samples of acoustic data with the annotated correct word sequence),
the correct answer being the utterance in a spoken dialogue(Col 4, Ln 32-33, executes command associated with NLU results).

Regarding Claim 14:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches wherein each of the recognition scores comprises a numerical value based on one or more of a confidence score of speech recognition, an acoustic score indicating a similarity between the acoustic feature of the input speech and a feature based on the acoustic model, and a language score indicating a degree of matching between the speech recognition results and a language model(Col 6, Ln 53-59, the confidence score may be based on a number of factors including for example, the similarity of the sound to models for language sounds, and the likelihood that a particular word which matches the sounds would be included at the specific location).

Regarding Claim 15:
The combination of Hoffmeister and Piersol teaches the method of claim 9, and Hoffmeister teaches wherein the set of speech-recognition-result word vector expressions comprises a vector generated for each word in the set of speech recognition results with a space between adjacent words based on a morphological analysis(Col 15, Ln 34-45, For NLU processing the input... word sequence...represented by series of one hot vectors.....word embeddings that represent how individual words are used in a text corpus. Col 14, Ln 25-28, encoding to project data points into a vector space...to determine how they relate to each other),
and wherein the set of speech-recognition-result part-of-speech vector expressions comprises a vector generated for each part-of-speech for words in the set of speech recognition results(Col 15, Ln 34-45, For NLU processing the input... word sequence...represented by series of one hot vectors...augmented with....labels from a tagger e.g. part-of-speech tagger).

Regarding Claim 16:
Hoffmeister teaches a system comprising: a processor; and a memory storing computer executable instructions that when executed by the processor cause the system to(Fig 18, shows computer with processor, memory, storage. Col 30, Ln 17, computer-readable instructions): 
receive an input speech(Col 3, Ln 9-10, receiving a spoken user query), 
the input speech comprising a speech uttered by a speaker and a noise Col 3, Ln 9-10, receiving a spoken user query. Col 5, Ln 36-38, speech from background noise. Therefore, noise is included in input; 
detect, based the input speech, an utterance, the utterance corresponding to the speech uttered by the speaker(Col 5, Ln 27-32, determine whether speech is present in audio input); 
extract an acoustic feature from the uttered speech(Col 5, Ln 30-35, signal to noise ratio. Also Col 15, Ln 19-23, For ASR processing...acoustic features....LFBE...MFCC or other features, are determined and used); 
generate a set of speech recognition results with recognition scores based on the uttered speech(Col 6, Ln 49-59, utterance assigned...confidence score); 
generate a set of speech-recognition-result word vector expressions and a set of speech-recognition-result part-of-speech vector expressions based on the set of speech recognition results with recognition scores(Col 15, Ln 34-45, input....the one hot is augmented... .including but not limited to...word embeddings, POS tagger. Input is shown to be based off of set of results and scores as shown above in Col 6, Ln 49-59 citation); 
generate a target utterance estimation model based on the extracted acoustic feature, the generated set of speech recognition results with recognition scores, the generated set of speech-recognition-result word vector expressions, an utterance time length of the uttered speech, and the generated set of speech-recognition-result part-of- speech vector expressions Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 5, Ln 29-35, signal to noise ratio is used to detect speech in input, Col 15, Ln 19-23, For ASR processing... acoustic features....LFBE...MFCC or other features, are determined and used. The model is based on the ASR output which uses these features. Also, Col 14, Ln 10-15, ASR component may be configured to output data calculated by the ASR component during processing.....use such data to confirm results of ASR. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..); 
4U.S. Patent Application Serial No. 16/968,126Amendment dated May 16, 2022Reply to Office Action of February 15, 2022provide, by the generated target utterance estimation model, a probability of the uttered speech detected from the input speech being an utterance suitable for a predetermined purpose(Col 16, Ln 28-29, the above techniques may be used to assign a confidence score to an ASR result. Col 4, Ln 32-33, executes command associated with NLU results. Reference provides examples like Col 16, Ln 50-55, confirm wakeword detection), 
wherein the generated target utterance estimate model predicts …..based at least on a combination of the generated set of speech-recognition-result word vector expressions and the utterance time length of the uttered speech(Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..)).
Hoffmeister does not teach wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise…wherein the predetermined purpose excludes the sudden noise; and causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise.
In the same field of ASR, Piersol teaches wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored) 
wherein the predetermined purpose excludes the sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored); 
and causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Hoffmeister, with the sudden noise detection of Piersol, as it prevents malfunctions of non-speech sounds being detected as input speech(Col 14, Ln 57-64).

Regarding Claim 17:
Claim 17 contains similar limitations as Claim 10, and is therefore rejected for the same reasons.

Regarding Claim 18:
Claim 18 contains similar limitations as Claim 11, and is therefore rejected for the same reasons.

Regarding Claim 19:
Claim 19 contains similar limitations as Claim 12, and is therefore rejected for the same reasons.

Regarding Claim 20:
Claim 20 contains similar limitations as Claim 13, and is therefore rejected for the same reasons.

Regarding Claim 21:
Claim 21 contains similar limitations as Claim 14, and is therefore rejected for the same reasons.

Regarding Claim 22:
Claim 22 contains similar limitations as Claim 15, and is therefore rejected for the same reasons.

Regarding Claim 23:
Hoffmeister teaches a computer-readable non-transitory recording medium storing computer-executable instructions that when executed by a processor cause a computer system to(Col 30, Ln 17, computer-readable instructions. Col 30, Ln 24-27, may include RAM, ROM, MRAM. Fig 18, shows computer with processor, memory, storage):
receive an input speech(Col 3, Ln 9-10, receiving a spoken user query), 
the input speech comprising a speech uttered by a speaker and a noise Col 3, Ln 9-10, receiving a spoken user query. Col 5, Ln 36-38, speech from background noise. Therefore, noise is included in input; 
detect, based the input speech, an utterance, the utterance corresponding to the speech uttered by the speaker(Col 5, Ln 27-32, determine whether speech is present in audio input); 
extract an acoustic feature from the uttered speech(Col 5, Ln 30-35, signal to noise ratio. Also Col 15, Ln 19-23, For ASR processing...acoustic features....LFBE...MFCC or other features, are determined and used); 
generate a set of speech recognition results with recognition scores based on the uttered speech(Col 6, Ln 49-59, utterance assigned...confidence score); 
generate a set of speech-recognition-result word vector expressions and a set of speech-recognition-result part-of-speech vector expressions based on the set of speech recognition results with recognition scores(Col 15, Ln 34-45, input....the one hot is augmented... .including but not limited to...word embeddings, POS tagger. Input is shown to be based off of set of results and scores as shown above in Col 6, Ln 49-59 citation); 
generate a target utterance estimation model based on the extracted acoustic feature, the generated set of speech recognition results with recognition scores, the generated set of speech-recognition-result word vector expressions, an utterance time length of the uttered speech, and the generated set of speech-recognition-result part-of- speech vector expressions Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 5, Ln 29-35, signal to noise ratio is used to detect speech in input, Col 15, Ln 19-23, For ASR processing... acoustic features....LFBE...MFCC or other features, are determined and used. The model is based on the ASR output which uses these features. Also, Col 14, Ln 10-15, ASR component may be configured to output data calculated by the ASR component during processing.....use such data to confirm results of ASR. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..); 
4U.S. Patent Application Serial No. 16/968,126Amendment dated May 16, 2022Reply to Office Action of February 15, 2022provide, by the generated target utterance estimation model, a probability of the uttered speech detected from the input speech being an utterance suitable for a predetermined purpose(Col 16, Ln 28-29, the above techniques may be used to assign a confidence score to an ASR result. Col 4, Ln 32-33, executes command associated with NLU results. Reference provides examples like Col 16, Ln 50-55, confirm wakeword detection), 
wherein the generated target utterance estimate model predicts …..based at least on a combination of the generated set of speech-recognition-result word vector expressions and the utterance time length of the uttered speech(Col 15, Ln 34-45, input....the one hot is augmented....including but not limited to...word embeddings, POS tagger. Col 6, Ln 59-64, each interpretation of the utterance is associate with a confidence score, ASR process outputs the most likely, the model is based on the ASR output. Col 20, Ln 13-16, feature vector 1134 is based on data describing characteristics. Col 20, Ln 21-24, data describing characteristics ….. include the….duration (in time or number of audio frames). Also Col 20, Ln 66– Col 21, Ln 2, classifier G …. trained on (and process during runtime) additional inputs … (e.g., time data..)).
Hoffmeister does not teach wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise…wherein the predetermined purpose excludes the sudden noise; and causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise.
In the same field of ASR, Piersol teaches wherein the generated target utterance estimate model predicts the uttered speech as a sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored) 
wherein the predetermined purpose excludes the sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored); 
and causing, based on the probability of the uttered speech detected from the input speech being the utterance suitable for the predetermined purpose, removal of the uttered speech as the sudden noise(Col 14, Ln 61-64,  In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored).
It would have been obvious for one skilled in the art, at the effective time of filling, to modify Hoffmeister, with the sudden noise detection of Piersol, as it prevents malfunctions of non-speech sounds being detected as input speech(Col 14, Ln 57-64).

Regarding Claim 24:
Claim 24 contains similar limitations as Claim 10, and is therefore rejected for the same reasons.

Regarding Claim 25:
Claim 25 contains similar limitations as Claim 11, and is therefore rejected for the same reasons.

Regarding Claim 26:
Claim 26 contains similar limitations as Claim 12, and is therefore rejected for the same reasons.

Regarding Claim 27:
Claim 27 contains similar limitations as Claim 13, and is therefore rejected for the same reasons.

Regarding Claim 28:
Claim 28 contains similar limitations as Claim 15, and is therefore rejected for the same reasons.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER G MARLOW whose telephone number is (571)272-4536. The examiner can normally be reached Monday - Thursday 10:00 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richmond Dorvil can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALEXANDER G MARLOW/Assistant Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658