DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/17/2022 has been entered.
 
Response to Arguments
Applicant's arguments filed 02/17/2022 have been fully considered but they are not persuasive. Regarding arguments on pages 11-12 of the Remarks, Examiner notes that the amendments to the claims do not appear to substantially change the scope, and that the interpretation presented in the interview is still valid. The weighted vector of Fan from col. 2 line 53 – col. 3 line 6 includes weighted portions of the utterance corresponding to speech from different speakers. Therefore, the portions of the first speaker’s speech would correspond to the first weighted vector and the portions of the second speaker’s speech would correspond to the second weighted vector. Therefore, the weighted vector of Fan is interpreted as a combination of the two weighted vectors corresponding to the two speakers.

Claim Objections
Claim 5 objected to because of the following informalities:  line 7 reads “each of the first” which should read “the one of the first” to align with similar amendments to claims 12 and 17. Similarly, the third to last paragraphs of claims 5 and 12 along with the last 5 lines of the third to last paragraph of claim 17 appear to be amended differently, and it is not known if this is intentional.  Appropriate correction is required.
Claim 12 objected to because of the following informalities:  line 11 reads “between the multi-dimensional vector” which should read “between the second multi-dimensional vector” to align with similar claims 5 and 17.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4-6, 8, 11-13, 15, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fan et al. (US 10,923,111 B1), hereinafter referred to as Fan, in view of Le Roux et al. (US 2019/0318725 A1), hereinafter referred to as Le Roux.

Regarding claim 1, Fan teaches:
A system comprising: 
a processing unit (Fig. 11 element 1104, col. 28 lines 6-16, where a processor is used); and 
a memory storage device including program code (Fig. 11 element 1106, col. 28 lines 6-16, where memory is used) that when executed by the processing unit enables the system to: 
determine a first plurality of multi-dimensional vectors, each of the first plurality of multi-dimensional vectors representing a respective frame of speech of a target speaker (col. 22 lines 17-39, where audio feature vectors of the beginning of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the first portion corresponds to the target speaker); 
determine a second plurality of multi-dimensional vectors, each of the second plurality of multi-dimensional vectors representing a respective frame of speech of a competing speaker (col. 22 lines 17-39, where audio feature vectors of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the other portions correspond to other speakers); 
determine a multi-dimensional vector representing a frame of a speech signal of at least the target speaker and the competing speaker (col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker); 
determine, for each one of the first plurality of multi-dimensional vectors, a respective similarity between the multi-dimensional vector representing the frame of the speech signal and the one of the first plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determine a weighted vector representing speech of the target speaker based on the determined respective similarities between the multi-dimensional vector representing the frame of the speech signal and each one of the first plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech); and 
determine, for each one of the second plurality of multi-dimensional vectors,  a respective similarity between the multi-dimensional vector representing the frame of the speech signal and the one of the second plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); and 
determine a second weighted vector representing speech of the competing speaker based on the determined respective similarities between the multi-dimensional vector representing the frame of the speech signal and each one of the second plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech), 
extract a frame of speech of the target speaker from the frame of the speech signal based on the weighted vector, the second weighted vector, and the frame of the speech signal (col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise).  
Fan does not teach that the speech signals of two or more speakers are within the same frame.
Le Roux teaches that multiple speakers are speaking simultaneously (para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Fan by using the input mixture of Le Roux (Le Roux para [0061]) as the second portion of the input of Fan (Fan col. 13 line 63 - col. 14 line 16) by having the speakers speaking simultaneously, in order to dramatically better technology for real-world human machine interaction (Le Roux para [0045]).

Regarding claim 4, Fan in view of Le Roux teaches:
The system of claim 1, wherein the extracted frame of speech of the target speaker is determined based on the weighted vector, the second weighted vector, the multi- dimensional vector representing a frame of a speech signal, and the frame of the speech signal (Fan col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise).  

Regarding claim 5, Fan in view of Le Roux teaches:
The system of claim 4, the program code when executed by the processing unit enables the system to: 
determine a second multi-dimensional vector representing a second frame of the speech signal of at least the target speaker and the competing speaker (Fan col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker, col. 2 line 53 - col. 3 line 6, where multiple frames are in each portion, and Le Roux para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously); 
determine, for each one of the first plurality of multi-dimensional vectors,  a respective similarity between the second multi-dimensional vector representing the second frame of the speech signal and each of the first plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determine a third weighted vector representing speech of the target speaker based on the determined respective similarities between the second multi-dimensional vector representing the second frame of the speech signal and each one of the first plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech); 
determine, for each one of the second plurality of multi-dimensional vectors,  a respective similarity between the second multi-dimensional vector representing the second frame of the speech signal and each one of the second plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determine a fourth weighted vector representing speech of the competing speaker based on the determined respective similarities between the second multi-dimensional vector representing the second frame of the speech signal and each one of the second plurality of multi-dimensional vectors, and on the second plurality of multi- dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech); and 
extract a frame of speech of the target speaker from the second frame of the speech signal based on the third weighted vector, the fourth weighted vector, and the frame of the speech signal of two or more speakers (Fan col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise), 
wherein the third weighted vector is different from the weighted vector and the fourth weighted vector is different from the second weighted vector (Fan col. 2 line 53 - col. 3 line 6,  where multiple frames are in each portion, and calculations are performed by the frames).  

Regarding claim 6, Fan in view of Le Roux teaches:
The system of claim 1, wherein a contribution of one of the first plurality of multi- dimensional vectors to the weighted vector is directly proportional to the similarity of the one of the first plurality of multi-dimensional vectors to the multi-dimensional vector representing the frame of the speech signal (Fan col. 2 line 53 - col. 3 line 6,  where weights are assigned based on similarity, an col. 20 lines 17-21, where the weights indicating similarity are directly proportional to the encoded features).  

Regarding claim 8, Fan teaches:
A computer-implemented method comprising: 
determining a first plurality of multi-dimensional vectors, each of the first plurality of multi-dimensional vectors representing speech of a target speaker (col. 22 lines 17-39, where audio feature vectors of the beginning of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the first portion corresponds to the target speaker); 
determining a second plurality of multi-dimensional vectors, each of the second plurality of multi-dimensional vectors representing speech of a competing speaker (col. 22 lines 17-39, where audio feature vectors of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the other portions correspond to other speakers); and 
determining a multi-dimensional vector representing a speech signal of at least the target speaker and the competing speaker (col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker); 
determining, for each one of the first plurality of multi-dimensional vectors, a respective similarity between the multi-dimensional vector representing the speech signal and the one of the first plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determining a weighted vector representing speech of the target speaker based on the determined respective similarities between the multi-dimensional vector representing the speech signal and each one of the first plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors); and 
determining, for each one of the second plurality of multi-dimensional vectors,  a respective similarity between the multi-dimensional vector representing the speech signal and the one of the second plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); and 
determining a second weighted vector representing speech of the competing speaker based on the determined respective similarities between the multi-dimensional vector representing the speech signal and each one of the second plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors), 
extracting speech of the target speaker from the speech signal based on the weighted vector, the second weighted vector and the speech signal (col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise).
Fan does not teach that the speech signals of two or more speakers are within the same frame.
Le Roux teaches that multiple speakers are speaking simultaneously (para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Fan by using the input mixture of Le Roux (Le Roux para [0061]) as the second portion of the input of Fan (Fan col. 13 line 63 - col. 14 line 16) by having the speakers speaking simultaneously, in order to dramatically better technology for real-world human machine interaction (Le Roux para [0045]).

Regarding claim 11, Fan in view of Le Roux teaches:
The method of claim 8, wherein the speech of the target speaker is extracted based on the weighted vector, the second weighted vector, the multi-dimensional vector representing the speech signal, and the speech signal (Fan col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise).  

Regarding claim 12, Fan in view of Le Roux teaches:
The method of claim 11, further comprising: 
determining a second multi-dimensional vector representing the speech signal (Fan col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker, col. 2 line 53 - col. 3 line 6, where multiple frames are in each portion, and Le Roux para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously); 
determining, for each one of the first plurality of multi-dimensional vectors, a respective similarity between the second multi-dimensional vector representing the speech signal and the one of the first plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determining a third weighted vector representing speech of the target speaker based on the determined respective similarities between the second multi-dimensional vector representing the speech signal and each one of the first plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors); 
determining, for each one of the second plurality of multi-dimensional vectors,  a respective similarity between the multi-dimensional vector representing the speech signal and the one of the second plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, where similarity is determined between the first and second portions of audio using the feature vectors); 
determining a fourth weighted vector representing speech of the competing speaker based on the determined respective similarities between the second multi-dimensional vector representing the speech signal and each one of the second plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors); and 
extracting second speech of the target speaker from the speech signal based on the third weighted vector, the fourth weighted vector, and the speech signal (Fan col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise), 
wherein the third weighted vector is different from the weighted vector and the fourth weighted vector is different from the second weighted vector (Fan col. 2 line 53 - col. 3 line 6,  where multiple frames are in each portion, and calculations are performed by the frames).  

Regarding claim 13, Fan in view of Le Roux teaches:
The method of claim 8, wherein a contribution of one of the first plurality of multi-dimensional vectors to the weighted vector is directly proportional to the similarity of the one of the first plurality of multi-dimensional vectors to the multi-dimensional vector representing the speech signal (Fan col. 2 line 53 - col. 3 line 6,  where weights are assigned based on similarity, an col. 20 lines 17-21, where the weights indicating similarity are directly proportional to the encoded features).  

Regarding claim 15, Fan teaches:
A non-transient, computer-readable medium storing program code to be executed by a processing unit to provide: 
an embedder network to determine a first plurality of multi-dimensional vectors based on respective frames of speech of a target speaker, to determine a second plurality of multi-dimensional vectors based on respective frames of speech of a competing speaker (col. 22 lines 17-39, where audio feature vectors of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the other portions correspond to other speakers), and to determine a multi-dimensional vector representing a frame of a speech signal of at least the target speaker and the competing speaker (col. 22 lines 17-39, where audio feature vectors of the beginning of an utterance are included, and col. 2 line 53 - col. 3 line 6, where the first portion corresponds to the target speaker, and col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker); 
an attention network to determine, for each one of the first plurality of multi-dimensional vectors, a respective similarity between the multi-dimensional vector representing the frame of the speech signal and the one of the first plurality of multi-dimensional vectors, to determine a weighted vector representing speech of the target speaker based on the determined respective similarities between the multi-dimensional vector representing the frame of the speech signal and each one of the first plurality of multi-dimensional vectors (col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors), to determine, for each one of the second plurality of multi-dimensional vectors, a respective similarity between the multi-dimensional vector representing the frame of the speech signal and the one of the second plurality of multi-dimensional vectors, and to determine a second weighted vector representing speech of the competing speaker based on the determined respective similarities between the multi-dimensional vector representing the frame of the speech signal and each one of second plurality of multi- dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors); and 
an extraction network to extract a frame of speech of the target speaker from the frame of the speech signal based on the weighted vector, the second weighted vector, and the frame of the speech signal (col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise).
Fan does not teach that the speech signals of two or more speakers are within the same frame.
Le Roux teaches that multiple speakers are speaking simultaneously (para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Fan by using the input mixture of Le Roux (Le Roux para [0061]) as the second portion of the input of Fan (Fan col. 13 line 63 - col. 14 line 16) by having the speakers speaking simultaneously, in order to dramatically better technology for real-world human machine interaction (Le Roux para [0045]).

Regarding claim 17, Fan in view of Le Roux teaches:
The medium of claim 15, 
the embedder network to determine a second multi-dimensional vector representing a second frame of the speech signal of at least the target speaker and the competing speaker (Fan col. 13 line 63 - col. 14 line 16, where the second portion contains speech from a first and second speaker, col. 2 line 53 - col. 3 line 6, where multiple frames are in each portion, and Le Roux para [0061], [0082], where the input mixture includes at least two speakers speaking simultaneously), 
the attention network to determine, for each one of the first plurality of multi-dimensional vectors, a respective similarity between the second multi-dimensional vector representing the second frame of the speech signal and the one of the first plurality of multi-dimensional vectors, to determine a third weighted vector representing speech of the target speaker based on the determined respective similarities between the second multi-dimensional vector representing the second frame of the speech signal and each one of the first plurality of multi-dimensional vectors, and on the first plurality of multi-dimensional vectors, to determine, for each one of the second plurality of multi-dimensional vectors a respective similarity between the second multi-dimensional vector representing the second frame of the speech signal and the one of the second plurality of multi-dimensional vectors, and to determine a fourth weighted vector representing speech of the competing speaker based on the determined respective similarities between the second multi-dimensional vector representing the second frame of the speech signal and each one of the second plurality of multi-dimensional vectors (Fan col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors, and col. 2 line 53 - col. 3 line 6, col. 3 line 66 - col. 4 line 31, and col. 17 lines 1-18, where an attention mechanism applies weights based on similarity to an encoded feature vector based on the first and second portions of speech, where similarity is determined between the first and second portions of audio using the feature vectors), and 
the extraction network to extract a second frame of speech of the target speaker from the frame of the speech signal based on the third weighted vector, the fourth weighted vector, and the frame of the speech signal (Fan col. 2 line 53 - col. 3 line 6, where the weighted output is used to determine output data including speech from the desired speaker without other speech/noise), 
wherein the third weighted vector is different from the weighted vector and the fourth weighted vector is different from the second weighted vector (Fan col. 2 line 53 - col. 3 line 6,  where multiple frames are in each portion, and calculations are performed by the frames).  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 2020/0125820 A1 para [0025-27] teaches speaker recognition using feature vector extraction and vector similarity.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRYAN S BLANKENAGEL whose telephone number is (571)270-0685. The examiner can normally be reached 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRYAN S BLANKENAGEL/Primary Examiner, Art Unit 2658