Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
All objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Response to Amendments 
Applicant’s amendment filed on April 9, 2021 has been entered. 
In view of the amendment to the specification, the objections to specification have been withdrawn. 
In view of the amendment to the claims, the amendment of claims 1, 8-10, and 13 and the cancellation of claims 5 and 12 have been acknowledged and entered.  
In view of the amendment to claim 8, the objection to claim 8 is withdrawn.
In view of the amendment to claims 8-10, and 13, the interpretation of claims 8-14 under 35 U.S.C. §112(f) is withdrawn.
In view of the cancellation of claims 5 and 12, the rejection of claims 5 and 12 under 35 U.S.C. §103 is withdrawn.
In light of the amendments to claims 1-4, 6-11, and 13-14, new grounds for rejection under 35 U.S.C. §103 are provided in the response below. 

Response to Arguments
Applicant’s arguments regarding the prior art rejections under 35 U.S.C. §103, see pages 10-13 of the Response to Non-Final Office Action dated January 13, 2021, which was received on April 9, 2021 (hereinafter Response and Office Action
With respect to the rejection(s) of amended claim(s) 1-4, 6-11, and 13-14 under 35 U.S.C. §103 as being unpatentable over Kim et al. (US 2015/0302855, hereinafter Kim) in view of Sun et al. (US 9,600,231, hereinafter Sun), Applicant’s arguments in light of the amendments have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Kim, Parthasarathi (U.S. Pat. App. Pub. No. 2017/0270919, hereinafter Parthasarathi), and Sun.
The Applicant has not provided any further statement and therefore, the Examiner directs the Applicant to the below rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. §102 and §103 (or as subject to pre-AIA  35 U.S.C. §102 and §103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 6, 8-11, and 13 is/are rejected under 35 U.S.C. §103 as being unpatentable over Kim in view of Parthasarathi.

Regarding claim 1, Kim discloses a voice detection method, suitable for providing a detected voice signal to a voice-to-text module, comprising: (“voice assistant application 130” which “may be configured to perform any suitable number of functions …each of which may Kim, ¶ [0025]) starting recording when a keyword audio signal in a first audio signal is detected ("Upon detecting the activation keyword (in the input sound stream), the DSP 250 may … start buffering (recording) the received input sound stream in the buffer memory 254."; Kim, ¶ [0046]); obtaining a plurality of keyword features in the keyword audio signal, (The method discloses obtaining "sound characteristics such as sound features and/or audio fingerprints may be extracted from the activation keyword and the speech in the buffered portions of the input sound stream," where the sound characteristics are the keyword features; Kim, ¶ [0048]) wherein the keyword features comprise an ending feature ("the DSP 250 may start buffering the input sound stream 610 in the buffer unit 320 upon detecting the end of the voice activation keyword at time T.sub.2." where detecting the end includes "determining a plurality of keyword scores for the buffered portion of the input sound stream 610 in the buffer unit 310 and comparing the keyword scores with an end threshold score," where the time frame after the maximum keyword score and the end threshold score is the ending feature; Kim, ¶¶ [0058], [0066]); [and] ending the recording according to the ending feature so as to obtain a second audio signal ("Upon detecting the activation keyword, the DSP 250 may stop buffering the input sound stream in the buffer unit 310," which ends the recording of the first audio signal including the keyword audio signal in the buffer unit 310 according to the ending feature. "The DSP 250 may start buffering the input sound stream 610 in the buffer unit 320 upon detecting the beginning of the speech command 630 at time T.sub.3." where the buffered input sound stream after time T.sub.3 is the second audio signal; Kim, ¶¶ [0042], [0059]). However, Kim fail(s) to expressly recite transmitting the keyword audio signal to the voice-to-text module; obtaining a voice recognition feature in the keyword features; comparing the voice recognition feature with features of the second audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not; and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user.

Parthasarathi teaches systems and methods distinguishing between desired and undesired speech. (Parthasarathi, ¶ [0030]). Regarding claim 1, Parthasarathi discloses transmitting the keyword audio signal to the voice-to-text module ("The device 110 may convert the audio 11 into audio data 111 and send the audio data to the server(s) 120. A server(s) 120 may then receive (130) the audio data 111 corresponding to the spoken command via the network 199," where the server 120 comprises “an ASR module 250 [which] may convert the audio data 111 into text {voice-to-text module}.”; Parthasarathi, ¶¶ [0032], [0047]) obtaining a voice recognition feature in the keyword features; ("The server 120 determines (132) reference audio data corresponding to the desired speaker of the input audio data 111. The reference audio data may be a first portion of the input audio data 111" where "the server 120 encodes (134) the reference audio data to obtain encoded reference audio data," and where the "the reference audio data (including feature vectors...) may be encoded by an encoder to result in encoded reference audio data.... [which] may then be used for speech detection and/or speech recognition." Thus, the system obtains feature vectors (voice recognition features) in the encoded reference audio data (keyword features).; Parthasarathi, ¶¶ [0032], [0107]) comparing the voice recognition feature with features of the second audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not ("The server 120 then processes (136) further input audio data (such as audio feature vectors corresponding to further audio frames) using the encoded reference audio data" and "the server 120 may use a classifier or other trained machine learning model to determine if the incoming audio feature vectors represent speech from the same speaker as the speech in the reference audio data by using the encoded reference audio data," thus determining if the reference audio data (the first audio signal) and the speech of the input audio data (the second audio signal) are provided by the same speaker (same user) by determining correspondence between through the classifier (comparing) the incoming audio feature vectors (features of the second audio signal) Parthasarathi, ¶ [0032]); and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user. (The system uses the determination of desired and undesired speech "to perform voice activity detection (VAD)... [and] may thus consider whether the audio feature vector is labeled as desired speech or undesired speech in whether or not to declare that voice activity is detected... if input audio corresponds to speech, but not necessarily to desired speech, the VAD module 222 may be configured to not declare speech detected so as not to cause the system to process undesired speech," where desired speech is defined as "speech from the same speaker as the reference audio data." Since, the VAD 222 is located in the device 110, only the desired speech (second audio signal which corresponds to the first audio signal, thus provided by the same user) is treated as speech and forwarded to the server 120 (the system declares that speech is not detected for undesired speech, and the system does not further process desired speech) and the “ASR module 250 [to] convert the audio data 111 into text {voice-to-text module}”.; Parthasarathi, ¶¶ [0128], [0113], [0047], FIGS. 1 and 16A).

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the voice assistant application of Kim to incorporate the teachings of Parthasarathi to include transmitting the keyword audio signal to the voice-to-text module; obtaining a voice recognition feature in the keyword features; comparing the voice recognition feature with features of the second audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not; and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user. The system and methods described in Parthasarathi “improves the ability of the system to identify speech from a desired user during a command interaction with a user in a manner that does not significantly impact latency yet still Parthasarathi, ¶ [0030]).

Regarding claim 2, the rejection of claim 1 is incorporated. Kim further discloses wherein the step of starting recording when the keyword audio signal in the first audio signal is detected comprises: starting recording when a volume of the keyword audio signal is detected to be greater than or equal to a preset value ("if the received input sound stream 610 is determined to include sound exceeding the predetermined sound intensity, the duty cycle function of the sound sensor 210 may be disabled… [and] the DSP 250 may buffer the received input sound stream 610 in the buffer unit 310 of the buffer memory 254," Kim, ¶ [0055], [0056]).

Regarding claim 3, the rejection of claim 1 is incorporated. Kim further discloses wherein the step of obtaining the keyword features in the keyword audio signal, wherein the keyword features comprise the ending feature, comprises: performing keyword processing on the keyword audio signal so as to obtain the keyword features in the keyword audio signal (The system "may sequentially extract a plurality of sound features" from the input sound stream (which includes the activation keyword) for the keyword detection score by processing the input sound stream to produce “audio fingerprints or MFCC (Mel-frequency cepstral coefficients) vectors,” which are keyword features obtained by performing keyword processing (e.g., MFCC); Kim, ¶ [0062]).

Regarding claim 4, the rejection of claim 3 is incorporated. Kim further discloses the keyword processing is at least one of sampling frequency comparison processing, short term power processing, zero-crossing processing, processing of mel scaled frequencies, cepstal coefficient processing, pitch processing, voice activity detection, fast Fourier transform or beamforming (keyword processing, as described in Kim, includes the production Kim, ¶ [0062]).

Regarding claim 6, the rejection of claim 1 is incorporated. Kim further discloses wherein the step of ending the recording according to the ending feature so as to obtain the second audio signal comprises: obtaining a plurality of recording features in the recording process ("The digital signal processor (DSP) 250 may detect the end of the activation keyword based on one or more of keyword scores determined for the input sound stream buffered in the buffer unit 310," where the keyword scores are the plurality of recording features; Kim, ¶ [0064]); comparing the ending feature with the recording features, so as to judge whether at least one of the recording features in the recording process conforms to the ending feature or not ("after determining the maximum keyword score, the DSP 250 may detect the end of the activation keyword by comparing the subsequently determined keyword scores with a predetermined end threshold score.”; Kim, ¶ [0066]); and ending the recording when at least one of the recording features is judged to conform the ending feature ("Upon detecting the activation keyword, the DSP 250 may stop buffering the input sound stream in the buffer unit 310" where detection is established "when an end of the activation keyword is received." Thus, the system ends the recording of the recording features at buffer unit 310 when the keyword scores conform to the end threshold during a time frame after the maximum keyword score; Kim, ¶¶ [0042], [0063]).

Regarding claim 8, Kim discloses a voice detection device, suitable for performing voice detection on an audio signal and also suitable for being in communication with a voice-to-text module, comprising (“voice assistant application 130” which “may be configured to perform any suitable number of functions …each of which may be associated with a speech command”; Kim, ¶ [0025]): a keyword detector, used for detecting whether a first audio signal comprises a keyword audio signal or not (The systems and methods of Kim can be in the form of “logical blocks, modules, circuits, and algorithm steps,” thus disclosing modules. The DSP 250 can "detect the activation keyword (in the input sound stream),” thus detecting whether the first audio signal comprises a keyword audio signal or not; Kim, ¶¶ [0098]; [0046]); a keyword processing circuit, coupled to the keyword detector, and used for obtaining a plurality of keyword features in the keyword audio signal (The system discloses obtaining "sound characteristics such as sound features and/or audio fingerprints” which “may be extracted from the activation keyword and the speech in the buffered portions of the input sound stream," where this function can be performed by interconnected modules; Kim, ¶¶ [0048], [0098]); and a recorder, coupled to the keyword detector and the keyword processing circuit, wherein the recorder is used for recording when the keyword detector detects the keyword audio signal in the first audio signal ("Upon detecting the activation keyword (in the input sound stream), the DSP 250 may … start buffering (recording) the received input sound stream in the buffer memory 254," where the DSP 250 receives the input sound stream from the interconnected modules, thus, is coupled to the keyword detector and keyword processing circuit, and where “the portion of the input sound stream 810 buffered in the buffer unit 320,” thus the recording module of Kim receives the input sound stream, including activation keyword and the speech, to then buffer (record) said input sound stream; Kim, ¶¶ [0046], [0098], [0074]); [and] ending the recording according to an ending feature of the keyword features obtained by the keyword processing circuit during the recording, so as to obtain a second audio signal, ("the DSP 250 may start buffering the input sound stream 610 in the buffer unit 320 upon detecting the end of the voice activation keyword at time T.sub.2." where detecting the end includes "determining a plurality of keyword scores for the buffered portion of the input sound stream 610 in the buffer unit 310 and comparing the keyword scores with an end threshold score," where the time frame after the maximum keyword score and the end threshold score is the ending feature, and "Upon detecting the activation keyword, the DSP 250 may stop buffering the input sound stream in the Kim, ¶¶ [0058], [0066], [0042], [0059]). However, Kim fail(s) to expressly recite transmitting the keyword audio signal to the voice-to-text module, obtaining a voice recognition feature in the keyword features, comparing the voice recognition feature with features of the second  audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not, and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user

The relevance of Parthasarathi is described above with relation to claim 1. Regarding claim 8, Parthasarathi discloses transmitting the keyword audio signal to the voice-to-text module ("The device 110 may convert the audio 11 into audio data 111 and send the audio data to the server(s) 120. A server(s) 120 may then receive (130) the audio data 111 corresponding to the spoken command via the network 199," where the server 120 comprises “an ASR module 250 [which] may convert the audio data 111 into text {voice-to-text module}.”; Parthasarathi, ¶¶ [0032], [0047]) obtaining a voice recognition feature in the keyword features; ("The server 120 determines (132) reference audio data corresponding to the desired speaker of the input audio data 111. The reference audio data may be a first portion of the input audio data 111" where "the server 120 encodes (134) the reference audio data to obtain encoded reference audio data," and where the "the reference audio data (including feature vectors...) may be encoded by an encoder to result in encoded reference audio data.... [which] may then be used for speech detection and/or speech recognition." Thus, the system obtains feature vectors (voice recognition features) in the encoded reference audio data (keyword features).; Parthasarathi, ¶¶ [0032], [0107]) comparing the voice recognition feature with features of the second audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not ("The server 120 then processes (136) further input audio data (such as audio feature vectors corresponding to further audio frames) using the encoded reference audio data" and "the server 120 may use a classifier or other trained machine learning model to determine if the incoming audio feature vectors represent speech from the same speaker as the speech in the reference audio data by using the encoded reference audio data," thus determining if the reference audio data (the first audio signal) and the speech of the input audio data (the second audio signal) are provided by the same speaker (same user) by determining correspondence between through the classifier (comparing) the incoming audio feature vectors (features of the second audio signal) and the encoded reference audio data (voice recognition features).; Parthasarathi, ¶ [0032]); and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user. (The system uses the determination of desired and undesired speech "to perform voice activity detection (VAD)... [and] may thus consider whether the audio feature vector is labeled as desired speech or undesired speech in whether or not to declare that voice activity is detected... if input audio corresponds to speech, but not necessarily to desired speech, the VAD module 222 may be configured to not declare speech detected so as not to cause the system to process undesired speech," where desired speech is defined as "speech from the same speaker as the reference audio data." Since, the VAD 222 is located in the device 110, only the desired speech (second audio signal which corresponds to the first audio signal, thus provided by the same user) is treated as speech and forwarded to the server 120 (the system declares that speech is not detected for undesired speech, and the system does not further process desired speech) and the “ASR module 250 [to] convert the audio data 111 into text {voice-to-text module}; Parthasarathi, ¶¶ [0128], [0113], [0047], FIGS. 1 and 16A).

Kim to incorporate the teachings of Parthasarathi to include transmitting the keyword audio signal to the voice-to-text module; obtaining a voice recognition feature in the keyword features; comparing the voice recognition feature with features of the second audio signal to determine whether the second audio signal and the first audio signal are provided by the same user or not; and transmitting the second audio signal to the voice-to-text module when the second audio signal and the first audio signal are provided by the same user. The system and methods described in Parthasarathi “improves the ability of the system to identify speech from a desired user during a command interaction with a user in a manner that does not significantly impact latency yet still allows the system to distinguish desired speech from undesired speech.” (Parthasarathi, ¶ [0030]).

Regarding claim 9, the rejection of claim 8 is incorporated. Claim 9 is substantially the same as claim 2 and is therefore rejected under the same rationale as above.

Regarding claim 10, the rejection of claim 8 is incorporated. Claim 10 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.

Regarding claim 11, the rejection of claim 10 is incorporated. Claim 11 is substantially the same as claim 4 and is therefore rejected under the same rationale as above.

Regarding claim 13, the rejection of claim 8 is incorporated. Claim 13 is substantially the same as claim 6 and is therefore rejected under the same rationale as above.

Claims 7 and 14 is/are rejected under 35 U.S.C. §103 as being unpatentable over Kim and Parthasarathi as applied to claims 1 and 8, and in further view of Sun.

Regarding claim 7, the rejection of claim 1 is incorporated. Kim and Parthasarathi disclose all of the elements of the current invention as stated above. However, Kim and Parthasarathi fail(s) to expressly disclose wherein the step of transmitting the keyword audio signal and the second audio signal to the voice-to-text module comprises: converting a voice message corresponding to the second audio signal to a text message; and providing the keyword features into a database of the voice-to-text module, wherein the keyword features are used for enhancing voice recognition.

Sun teaches improvements to systems and methods for keyword spotting. (Sun, ¶ Col. 2, line 64 – Col. 3, line 6). Regarding claim 7, Sun teaches wherein the step of transmitting the keyword audio signal and the second audio signal to the voice-to-text module comprises: converting a voice message corresponding to the second audio signal to a text message ("Following detection of a wakeword, the device sends audio data 111 corresponding to the utterance, to an ASR module 250…" which "converts the audio data 111 into text,"; Sun, Col. 4, lines 25-27); and providing the keyword features into a database of the voice-to-text module (“A spoken utterance in the audio data is input (provided to) to a processor configured to perform ASR which then interprets the utterance based on the similarity between the utterance and pre-established language models 254 stored in an ASR model knowledge base (a database of the voice-to-text module).”; Sun, Col. 4, lines 55-60), wherein the keyword features are used for enhancing voice recognition (The feature vector including keyword features can be used to help the keyword detection module 220 "classify the individual audio data segment more efficiently," which enhances voice recognition.; Sun, Col. 11, lines 40-50, Col. 14, lines 52-64).

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the voice assistant application of Kim as Parthasarathi to incorporate the teachings of Sun to include wherein the step of transmitting the keyword audio signal and the second audio signal to the voice-to-text module comprises: converting a voice message corresponding to the second audio signal to a text message; and providing the keyword features into a database of the voice-to-text module, wherein the keyword features are used for enhancing voice recognition. The system and methods described in Sun can be applied to “reduce the complexity of the classifier without sufficiently impacting performance quality, thus resulting in more efficient key word spotting.” (Sun, Col. 3, lines 6-9).

Regarding claim 14, the rejection of claim 8 is incorporated. Claim 14 is substantially the same as claim 7 and is therefore rejected under the same rationale as above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627.  The examiner can normally be reached on 07:00-17:00 M-F.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SES/Patent Examiner, Art Unit 2657                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        

05/26/2021