DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
2.	 Applicant’s arguments and amendments in the Amendment, with respect to the rejections of claims 1, 10, and 16, and claims depending therefrom, under 35 U.S.C. 103 have been fully considered and are persuasive in part, as detailed below.  Therefore, the rejection has been withdrawn.  However, upon further consideration, new grounds of rejection are made in view of Krishnan et al., U.S. Patent App. Pub. No. 20200242465. Original Claims 1, 10, and 16 are amended.  Amended independent Claims 1, 10, and 16 have been considered as discussed below.  
3.	Applicant argues in the Amendment that Zhao does not describe a “speaker signature including a vector component” as now recited in amended independent Claims 1, 10, and 16.  Krishnan et al., U.S. Patent App. Pub. No. 20200242465 is cited as teaching these features, as discussed below. 
4.	Applicant also states “the office action simply cites to pages as a whole rather than particularly pointing out the actual elements of the prior art that it relies on.”  The only example provided is the statement “For example, the Rejection never discloses which element of the prior art it is establishing as the "speaker signature."  The instant specification is then quoted with respect to details of the “speaker signature.”
5.	Initially, it is respectfully noted that the specific interpretation of “speaker signature” was in fact included in the Office Communication dated January 21, 2022 in the paragraph at the top of page 5.  (“The output of the LSTM layer is cited as “a speaker signature.””)  It is believed that all elements of the claim were properly mapped in the Office Communication of January 21, 2022.  Further, details from the specification are not generally read into the claims.  See MPEP 2111.01 (II).  Accordingly, the previous interpretation of “speaker signature” is believed to be proper.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


6.	Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al. (“Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition,” hereinafter “Zhao”) in view of Luo et al. (“Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” hereinafter “Luo”), U.S. Patent Pub. No. 20170032806 (Konjeti et al, hereinafter “Konjeti”), and U.S. Patent Pub. No. 20200242465 (Krishnan et al, hereinafter “Krishnan”)
	With regard to Claim 1, Zhao describes:
“A voice recognition system, comprising:
… identify the user utilizing a first encoder that includes a first convolutional neural network to output a speaker signature; (The LSTM layer in Zhao, described at page 97517)
output a matrix representative of the environmental noise and the one or more spoken dialogue commands; (The output of the FCN structure in Zhao, described at page 97519)
extract speech data from a mixture of the one or more spoken dialogue commands and the environmental noise utilizing a residual convolution neural network that includes one or more layers and utilizing the speaker signature.” (The output layer in Zhao, described at page 97517)


    PNG
    media_image1.png
    486
    731
    media_image1.png
    Greyscale


Figure 1 of Zhao (shown above) describes a system including a pair of Bi-LSTM (LSTM layer cited as “a first encoder”) that feed into an attention layer.  The output of the LSTM layer is cited as “a speaker signature.”  The speaker signature then feeds into an output layer (cited as “a residual CNN”).
Zhao does not explicitly describe:
“a microphone configured to receive one or more spoken dialogue commands from a user and environmental noise; 
a speaker signature including a vector component; and
a processor in communication with the microphone, wherein the processor is configured to:
receive one or more spoken dialogue commands and the environmental noise from the microphone … a speaker signature derived from a time domain signal associated with the spoken dialogue commands.”
Luo describes “receive one or more spoken dialogue commands and the environmental noise from the microphone … a speaker signature derived from a time domain signal associated with the spoken dialogue commands.”  Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
It would have been obvious before the effective filing date of the claimed invention to include the time domain input of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.

    PNG
    media_image2.png
    335
    1146
    media_image2.png
    Greyscale

Zhao in view of Luo does not describe:
“a microphone configured to receive one or more spoken dialogue commands from a user and environmental noise; and
a processor in communication with the microphone” and 
“a speaker signature including a vector component.”
However, paragraph 22 of Konjeti describes a system 10 that includes a microphone 18 and controllers 14 and 16 (cited as “a processor” as described in paragraph 19 of Konjeti).
 It would have been obvious before the effective filing date of the claimed invention to include the microphone and processor of Konjeti into the system of Zhao in view of Luo to directly provide audio data, as described in paragraph 22 of Konjeti.
Zhao in view of Luo and Konjeti does not explicitly describe “a speaker signature including a vector component.”
However, paragraph 22 of Krishnan describes a semantic signature that is the output of an LSTM and that indicative of a probability that the item belongs to one or more item categories.  A probability for each of a plurality of categories would be structured as a vector.
It would have been obvious before the effective filing date of the claimed invention to include the signature with a vector structure of Krishnan into the system of Zhao in view of Luo and Konjeti to understand the semantic meaning of pieces of unstructured data separately, as described in paragraph 25 of Krishnan.
With regard to Claim 2, Zhao describes “the audio data indicating the spoken dialogue commands contains no environmental noise.”  Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to remove the environmental noise.
With regard to Claim 3, Zhao describes “the audio data indicating the spoken dialogue commands contains mitigated environmental noise.” Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to mitigate the environmental noise. 
With regard to Claim 4, Zhao describes “the first encoder includes a multi-layer long short-term memory network.”  Figure 1 of Zhao shows that the first encoder includes a Bi-LSTM.
With regard to Claim 5, Luo describes “the audio data includes the spoken dialogue commands.”  Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
It would have been obvious before the effective filing date of the claimed invention to include the time domain input of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.
With regard to Claim 6, Zhao describes “the residual convolution neural network includes multiple layers.”  Figure 1 of Zhao shows 2 Bi-LSTM layers.
With regard to Claim 7, Luo describes “the one or more layers of the residual convolution neural network includes two or more dilation segments.”  Page 4 of Luo describes:

    PNG
    media_image3.png
    200
    400
    media_image3.png
    Greyscale

It would have been obvious before the effective filing date of the claimed invention to include the dilation scheme of Luo into the system of Zhao to a sufficiently large temporal context window to take advantage of the long term dependencies of the speech signal, as described on page 4 of Luo.
With regard to Claim 8, Luo describes “the two or more dilation segments include different time periods.”  Page 4 of Luo describes:

    PNG
    media_image3.png
    200
    400
    media_image3.png
    Greyscale

It would have been obvious before the effective filing date of the claimed invention to include the dilation scheme of Luo into the system of Zhao to a sufficiently large temporal context window to take advantage of the long term dependencies of the speech signal, as described on page 4 of Luo.
With regard to Claim 9, Zhao describes “the processor is further configured to ignore the speech data when it is not associated with the speaker signature.”  Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to ignore speech data when it is not associated with the speaker signature.
	With regard to Claim 10, Zhao describes:
“A voice recognition system, comprising:
… identify a user utilizing a first encoder that includes a convolutional neural network to output a speaker signature (The LSTM layer in Zhao, described at page 97517) and output a matrix representative of the environmental noise and the one or more spoken dialogue commands;  (The output of the FCN structure in Zhao, described at page 97519)
extract speech data from the mixture utilizing a residual convolution neural network (CNN) that includes one or more layers and utilizing the speaker signature.”  (The output layer in Zhao, described at page 97517)

    PNG
    media_image4.png
    648
    975
    media_image4.png
    Greyscale


Figure 1 of Zhao (shown above) describes a system including a pair of Bi-LSTM (LSTM layer cited as “a first encoder”) that feed into an attention layer.  The output of the LSTM layer is cited as “a speaker signature.”  The speaker signature then feeds into an output layer (cited as “a residual CNN”).
Zhao does not describe:
“a controller configured to:
receive one or more spoken dialogue commands and environmental noise from a microphone;
receive a mixture that includes the one or more spoken dialogue commands and the environmental noise;
in response to the speech data being associated with the speaker signature, output audio data including the spoken dialogue commands” and
“a speaker signature including a vector component.”
Luo describes “receive one or more spoken dialogue commands and the environmental noise” and “receive a mixture that includes the one or more spoken dialogue commands and the environmental noise.”  Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
Luo also describes “in response to the speech data being associated with the speaker signature, output audio data including the spoken dialogue commands.”  Figure 1 of Luo also shows the output of one of the separated sources (cited as the “spoken dialogue commands”)
It would have been obvious before the effective filing date of the claimed invention to include the time domain input and separated output of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.

    PNG
    media_image2.png
    335
    1146
    media_image2.png
    Greyscale

Zhao in view of Luo does not describe: “a controller configured to:” and “receive one or more spoken dialogue commands and environmental noise from a microphone” and “a speaker signature including a vector component.”
However, paragraph 22 of Konjeti describes a system 10 that includes controllers 14 and 16 and microphone 18.
 It would have been obvious before the effective filing date of the claimed invention to include the controller and microphone of Konjeti into the system of Zhao in view of Luo to receive audio data and control the audio data analysis, as described in paragraph 22 of Konjeti.
Zhao in view of Luo and Konjeti does not explicitly describe “a speaker signature including a vector component.”
However, paragraph 22 of Krishnan describes a semantic signature that is the output of an LSTM and that indicative of a probability that the item belongs to one or more item categories.  A probability for each of a plurality of categories would be structured as a vector.
It would have been obvious before the effective filing date of the claimed invention to include the signature with a vector structure of Krishnan into the system of Zhao in view of Luo and Konjeti to understand the semantic meaning of pieces of unstructured data separately, as described in paragraph 25 of Krishnan.
With regard to Claim 11, Zhao describes “the audio data indicating the spoken dialogue commands contains no environmental noise.”  Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to remove the environmental noise.
With regard to Claim 12, Zhao describes “the audio data indicating the spoken dialogue commands contains mitigated environmental noise.” Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to mitigate the environmental noise.
With regard to Claim 13, Zhao does not explicitly describe “the speaker signature is derived from a time domain signal associated with the spoken dialogue commands.”  However, Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
It would have been obvious before the effective filing date of the claimed invention to include the time domain input of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.
With regard to Claim 14, Zhao in view of Luo does not describe “the voice recognition system is a smart speaker.”  However, paragraph 23 of Konjeti describes a system 10 that includes smart speakers 20a-20n.
 It would have been obvious before the effective filing date of the claimed invention to include the smart speakers of Konjeti into the system of Zhao in view of Luo to provide audio feedback to a user, as described in paragraph 23 of Konjeti.
With regard to Claim 15, Zhao in view of Luo does not describe “the voice recognition system is a vehicle multimedia system.”  However, paragraphs 22 and 23 of Konjeti describes a vehicle multimedia system 10.
 It would have been obvious before the effective filing date of the claimed invention to include the vehicle multimedia system of Konjeti into the system of Zhao in view of Luo to receive audio commands from a vehicle occupant and provide audio feedback, as described in paragraphs 22 and 23 of Konjeti.
With regard to Claim 16, Zhao describes:
“A voice recognition system comprising:
identify a user utilizing a first encoder that includes a convolutional neural network to output a speaker signature (The LSTM layer in Zhao, described at page 97517) and output a matrix representative of the environmental noise and the one or more spoken dialogue commands; (The output of the FCN structure in Zhao, described at page 97519)
extract speech data from a mixture including the environmental noise and one or more spoken dialogue commands utilizing a residual convolution neural network (CNN) that includes one or more layers and utilizing the speaker signature” (The output layer in Zhao, described at page 97517)

    PNG
    media_image5.png
    648
    975
    media_image5.png
    Greyscale

Figure 1 of Zhao (shown above) describes a system including a pair of Bi-LSTM (LSTM layer cited as “a first encoder”) that feed into an attention layer.  The output of the LSTM layer is cited as “a speaker signature.”  The speaker signature then feeds into an output layer (cited as “a residual CNN”).
Zhao does not describe:
“a computer readable medium storing instructions that, when executed by a processor, cause the processor to:
receive one or more spoken dialogue commands and environmental noise from a microphone and
in response to the speech data being associated with the speaker signature, output audio data including the spoken dialogue commands” and
“a speaker signature including a vector component.”
Luo describes “receive one or more spoken dialogue commands and environmental noise.”  Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
Luo also describes “in response to the speech data being associated with the speaker signature, output audio data including the spoken dialogue commands.”  Figure 1 of Luo also shows the output of one of the separated sources (cited as the “spoken dialogue commands”)
It would have been obvious before the effective filing date of the claimed invention to include the time domain input and separated output of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.
Zhao in view of Luo does not describe: “a computer readable medium storing instructions that, when executed by a processor, cause the processor to:” and “receive one or more spoken dialogue commands and environmental noise from a microphone” and “a speaker signature including a vector component.”
However, paragraph 22 of Konjeti describes a system 10 that includes controllers 14 and 16 and microphone 18.  Paragraph 19 of Konjeti describes that the controllers can include processors, and computer readable media.
 It would have been obvious before the effective filing date of the claimed invention to include the controller and microphone of Konjeti into the system of Zhao in view of Luo to receive audio data and control the audio data analysis, as described in paragraph 22 of Konjeti.
Zhao in view of Luo and Konjeti does not explicitly describe “a speaker signature including a vector component.”
However, paragraph 22 of Krishnan describes a semantic signature that is the output of an LSTM and that indicative of a probability that the item belongs to one or more item categories.  A probability for each of a plurality of categories would be structured as a vector.
It would have been obvious before the effective filing date of the claimed invention to include the signature with a vector structure of Krishnan into the system of Zhao in view of Luo and Konjeti to understand the semantic meaning of pieces of unstructured data separately, as described in paragraph 25 of Krishnan.
With regard to Claim 17, Zhao describes “the audio data indicating the spoken dialogue commands contains no environmental noise.”  Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to remove the environmental noise.
With regard to Claim 18, Zhao describes “the audio data indicating the spoken dialogue commands contains mitigated environmental noise.” Page 97519 of Zhao describes an attention layer that helps the network pay more attention to specific time-frequency regions of the input spectrogram.  Thus, this layer can be programmed to mitigate the environmental noise.
With regard to Claim 19, Zhao does not explicitly describe “the speaker signature is derived from a time domain signal associated with the spoken dialogue commands.”  However, Figure 1 of Luo shows an input time domain signal including spoken dialogue commands and environmental noise.
It would have been obvious before the effective filing date of the claimed invention to include the time domain input of Luo into the system of Zhao to enable faster training and better convergence, as described on page 3 of Luo.
With regard to Claim 20, Zhao in view of Luo does not describe “the voice recognition system is a smart speaker.”  However, paragraph 23 of Konjeti describes a system 10 that includes smart speakers 20a-20n.
 It would have been obvious before the effective filing date of the claimed invention to include the smart speakers of Konjeti into the system of Zhao in view of Luo to provide audio feedback to a user, as described in paragraph 23 of Konjeti.



Conclusion
7.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
U.S. Pat. App. Pub. No. 20210165954 (Iyer et al.) also a signature that includes a vector.
8.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWARD TRACY whose telephone number is (571)272-8332. The examiner can normally be reached Monday-Friday 9 AM- 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached at 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/EDWARD TRACY JR./           Examiner, Art Unit 2656     

/BHAVESH M MEHTA/           Supervisory Patent Examiner, Art Unit 2656