DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the
first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed on July 5th, 2022 has been entered. Claims 1-8 and 10-18 are now
pending in the application. 

Response to Arguments
Applicant’s arguments filed on July 5th, 2022 have been fully considered and agreed
upon according the cited references on the Non-final office action mailed on April 25th, 2022; however, NG does disclose features of the sentence embeddings corresponding to the voice signal, see on paras. 46 and 56. However, arguments are moot upon considering the amendments regarding change of scope and necessitating new art to consider the change of scope by the added limitations. Please see the factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 below. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35
U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness
rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under
35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 3-4, 8, 10-11, and 15-18 are rejected under 35 U.S.C. 103 as being
unpatentable over Visser (US 20150301796 A1), in view of NG et al. (US 2019/0355366 A1) hereinafter NG and further in view of Audio-Linguistic Embeddings for Spoken Sentences by Albert Haque hereinafter Haque and further in view of Delcroix Marc (JP 2020134567) hereinafter Marc.
Regarding claim 1, Visser teaches a method of generating a speaker identification neural
network (Para. 462, The speech validator 3206 may determine the reverberation time based
on a model (e.g., a Gaussian mixture model (GMM), a deep neural network (DNN), or another
model; furthermore, A mobile station modem device that uses classifier models such as a Deep Neural Network, see para 504; where a memory 4224 may include instructions 4260 executable by processor 4206, see para. 502; in addition, see. Para. 512 for various modifications on aspects as other principles applied to other aspects applied hereinafter), the method comprising:
generating a first neural network that is trained to identify a first speaker with respect
to a first voice signal in a first environment in which a first signal-to-noise (SNR) value of the
first voice signal is greater than or equal to a threshold value (Para. 400, speaker verification to
access the application i.e. identify number of speakers in this case a first speaker; Para. 406,
First speaker model generated, 3292 taught to have a first signal-to-noise ratio; Para. 462 The
speech validator 3206 may determine characteristics based on a model such as a GMM, DNN,
or other model; The threshold value for the neural networks are interpreted as arbitrary and
the voice signals can be characterized differently when compared in magnitude to a specified
threshold value, see para 123 where SNR exceeds or does not exceed a threshold and speaker
models are chosen accordingly their training);
generating a second neural network for identifying a second speaker with respect to a
second voice signal in a second environment in which a second SNR value of the second voice
signal is less than the threshold value (Para. 400, speaker verification to access the application
i.e. identify number of speakers in this case a second speaker; Para. 406, A second speaker
model generated, 3294 where it has a second signal-to-noise ratio; Para. 462 The speech
validator 3206 may determine characteristics based on a model such as a GMM, DNN, or other model; The threshold value for the neural networks are interpreted as arbitrary and the voice signals can be characterized differently when compared in magnitude to a specified threshold value, see para 123 where SNR exceeds or does not exceed a threshold and speaker models are chosen accordingly their training).
However, while Visser teaches the use of i-vectors in embeddings for the neural
networks, see para. 484, Visser fails to explicitly disclose:
wherein the first neural network is configured to identify the first speaker based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal;
wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal;
generating the speaker identification neural network by training the second neural
network based on a teacher-student training model in which the first neural network is set to a
teacher neural network and the second neural network is set to a student neural network, 
	wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors.
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. The method 500 may be performed by one or more components of a speaker recognition system such as the speaker recognition system 100 described above. For example, the method 500 may be performed by the controller 120. In an example, at least one portion of the method 500 is implemented by executable code, stored on a non-transitory storage medium, that includes instructions, that when executed by at least one processor, causes the at least one processor to perform the at least one portion of the method 500 described herein, this is the same for apparatus 600 as well, see para. 60), NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer. Furthermore, Various measures (for example systems, methods, computer programs and computer-readable media) are provided in which a first neural network is trained to be used in speaker recognition. An embedding vector is extracted from the first neural network, the sentence embedding vector being an intermediate output of the first neural network, see para. 107. NG discloses a teacher-student neural network training may be used in a speaker recognition system. The second ANN may be a relatively large-scale ANN compared to the first ANN. The first ANN may be a relatively small-scale ANN compared to the second ANN. In some examples, the first ANN is 60-95% smaller than the second ANN. In some examples, the first ANN is obtained by compressing the second ANN using a knowledge distillation technique. The first ANN may be considered to be a “compressed” version of the second ANN i.e. first and second are used to distinguish, and in this case,  it is training the student neural network model referred as the first model based on the teacher-student training model in which the teacher neural network model referred as the second model; however, the student neural network model is what the model being trained after the teacher model, see para. 37. These sentence embeddings from the second and first neural network as per para. 56 correspond to sentence frames of the first and second voice signal as per the input, see para. 29. Finally, data reflect read and spontaneous speech from a large number of speakers with various acoustic channel conditions. The training data may be representative of real speech and may be sufficiently diverse to prevent overtraining and/or overfitting. In some cases, the data set used for training is modified to simulate babble noise, music noise, additive noise and/or reverberation. Additionally or alternatively, the training data may contain real noise, reverberation, intra-speaker variability and/or compression artefacts. The training data set may comprise out-of-domain data and/or in-domain data i.e. the SNR is representative of the various conditions given, see para. 54. 
Modifying Visser to include the features of NG further discloses:
wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector that are output from a last hidden layer of the first
neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal (e.g. Visser’s speaker verification neural network where the first neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector that are output from a last hidden layer of the first neural network and intermediate to the output layer i.e. immediately before, see para. 29, 46, 56, and 107, as taught by NG);
wherein the second neural network is configured to identify the second speaker based on a second sentence embedding vector that are output from a last hidden layer of the
second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector that are output from a last hidden layer of the second neural network and intermediate to the output layer i.e. immediately before, see paras. 29, 46, 56, and 107, as taught by NG);
generating the speaker identification neural network by training the second neural
network based on a teacher-student training model in which the first neural network is set to a
teacher neural network and the second neural network is set to a student neural network (e.g.
Visser’s speaker verification neural network method of two models, now including the feature where the second model is trained on a teacher-student training model where the first neural network is the teacher and the second neural network is the student, see para. 37, as taught by NG).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits as recent research has proposed the use of artificial neural networks (ANNs) to perform speaker recognition. In some speaker recognition scenarios, trained ANNs have been shown to offer similar or improved accuracy relative to I-vector systems, see para. 7. Furthermore, embedding vectors extracted from an intermediate layer of the second ANN may be a more reliable training target for the first ANN than an output from the final layer of the second ANN. As such, using embedding vectors as training targets for the first ANN may result in the first ANN having a greater speaker recognition accuracy than using other data as training targets as taught by NG, see para. 93. Furthermore, Compressing the first speaker recognition model using a knowledge distillation compression technique enables speaker recognition to be performed with a reduced requirement of processing, storage, latency and/or power whilst maintaining a sufficient level of accuracy as taught by NG, see para. 104; moreover, A teacher-student-trained artificial neural network may perform text-independent speaker recognition more accurately and/or with a smaller footprint in terms of power, storage, latency and/or processing requirements, compared to other speaker recognition systems as taught by NG, see para. 109. Furthermore, by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.
	Visser in view of NG does not disclose:
wherein the first neural network is configured to identify the first speaker based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal;
wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal;
In a related field of endeavor, e.g. audio linguistic embeddings, see abstract. Haque teaches a uniform average intermediate embeddings are converted to a sentence embedding by computing an element-wise sum of the intermediate embeddings at each word or phoneme position and dividing by the number of words in a sentence i.e. weighted sum, see section 3.2.2. first item “Uniform Average”; therefore, the first sentence embedding vector is a weighted sum of the first embedding vectors and the second sentence embedding vector is a weighted sum of the second embedding vectors. 
Modifying Visser to include the features of NG further and further in view of Haque discloses:
wherein the first neural network is configured to identify the first speaker based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal (e.g. Visser’s speaker verification neural network where the first neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network and intermediate to the output layer i.e. immediately before, see para. 29, 46, 56, and 107, as taught by NG, and further in view of Haque’s teaching of the sentence embedding vector representing a weighted sum of first embedding vectors, see section 3.2.2. first item “Uniform Average”);
wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the second neural network is configured to identify the second speaker based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network and intermediate to the output layer i.e. immediately before, see paras. 29, 46, 56, and 107, as taught by NG, and further in view of Haque’s teaching of the sentence embedding vector representing a weighted sum of second embedding vectors, see section 3.2.2. first item “Uniform Average”);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Haque to the disclosure of Visser in view of NG. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the three disclosures, for example, speech recognition. Further, doing so would have provided the users of Visser in view of NG, with added benefits learning long-term dependencies by modeling speech at the sentence level; furthermore, results have shown the spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks i.e. captures more information and is able to compress it statistically, see abstract. 
	Visser in view of NG and further in view of Haque fails to disclose:
	wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors.
	In a related field of endeavor, e.g. speaker verification, see abstract. Marc teaches, A sequence summarizing network with an attention mechanism is an auxiliary network described above with an attention mechanism. In the above auxiliary network, when the auxiliary information λ is obtained, the frame-wise vectors extracted from each time frame are integrated with equal weights, but the weights can be adjusted by using the attention mechanism. For example, the attention mechanism is learned so that the weight of the frame-wise vector extracted from the time frame containing a lot of noise is small and the weight of the frame-wise vector extracted from the time frame with less noise is large. It is possible to appropriately obtain auxiliary information representing the characteristics of the voice signal of the target speaker. The operation of the sequence summarizing network with the attention mechanism will be described in detail in the description of the modification of the first embodiment i.e. the speaker identification neural network uses an attention mechanism i.e. layer to adjust the initial weight of the second neural network such that a relatively large weight is applied to voice and a low weight is applied to noise and if noise is much larger than the voice signal, the voice signal might not exist and this is according to the second neural network, see para. 3 on pg. 3 and para. 3 of pg. 5. 
Modifying Visser to include the features of NG further and further in view of Haque and further in view of Marc discloses:
wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, modified by NG and Haque for the embedding vectors and teacher-student neural network setup, now also including the feature wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors as taught by Marc, see para. 3 on pg. 3 and para. 3 of pg. 5).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Marc to the disclosure of Visser in view of NG and further in view of Haque. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the three disclosures, for example, speech processing. Further, doing so would have provided the users of Visser in view of NG further in view of Haque, with added benefits as the increase in the number of parameters with the increase in the number of units is suppressed, so that the memory capacity for storing the trained model can be reduced according to the first embodiment; furthermore, not only the memory capacity but also the consumption of other computer resources such as processor time and disk IO can be reduced, see paras. 1-2 on pg. 5.

Regarding claim 3, the combination of Visser in view of NG further in view of Haque and Marc teaches the method of claim 1, additionally NG discloses using the same motivation above for claim 1:
wherein the generating of the speaker identification neural network comprises training
the second neural network such that a distance between the first embedding vector and the second embedding vector derived from the first voice signal in the first environment and the second voice signal in the second environment that mutually correspond is minimized (Para. 56, The student network may be trained subject to a minimum mean square error between predicted targets generated by the student network and the embedding vectors output by the teacher network(s), e.g. H.sub.xv(x.sub.p) or H.sub.rv(x.sub.p). i.e. distance is calculated and minimized while the neural networks have their corresponding environments as discussed in independent claim 1 in relation to their SNR magnitude relative to an arbitrary threshold that defines their corresponding environment, see further in para. 54).	
	Visser does not disclose:
wherein the generating of the speaker identification neural network comprises training
the second neural network such that a distance between the first sentence embedding vector and the second sentence embedding vector derived from the first voice signal in the first environment and the second voice signal in the second environment that mutually correspond is minimized;
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer.
Modifying Visser to include the features disclosed by NG discloses:
wherein the generating of the speaker identification neural network comprises training
the second neural network such that a distance between the first sentence embedding vector and the second sentence embedding vector derived from the first voice signal in the first environment and the second voice signal in the second environment that mutually correspond is minimized (e.g. Visser’s speaker identification method wherein the generating of the speaker identification neural network comprises training the second neural network such that a distance between the first embedding vector and the second embedding vector derived from the first voice signal in the first environment and the second voice signal in the second environment that mutually correspond is minimized, now also including the feature wherein the first and second embedding vectors are sentence embedding vectors as taught by NG, see paras. 46 and 56);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.

Regarding claim 4, the combination of Visser in view of NG further in view of Haque and Marc teaches the method of claim 3, additionally NG discloses using the same motivation above for claim 3:
wherein the distance between the first embedding vector and the second embedding vector is calculated based on at least one of a mean square error (MSE), a cosine similarity, and a Kullback-Leibler divergence (Para. 56, The student network may be trained subject to a minimum mean square error between predicted targets generated by the student network and the embedding vectors output by the teacher network(s), e.g. H.sub.xv(x.sub.p) or H.sub.rv(x.sub.p). i.e. distance is calculated and minimized). 
Visser does not disclose:
wherein the distance between the first sentence embedding vector and the second sentence embedding vector is calculated based on at least one of a mean square error (MSE), a cosine similarity, and a Kullback-Leibler divergence.
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer.
Modifying Visser to include the features disclosed by NG discloses:
wherein the distance between the first sentence embedding vector and the second sentence embedding vector is calculated based on at least one of a mean square error (MSE), a cosine similarity, and a Kullback-Leibler divergence (Para. 56, The student network may be trained subject to a minimum mean square error between predicted targets generated by the student network and the embedding vectors output by the teacher network(s), e.g. H.sub.xv(x.sub.p) or H.sub.rv(x.sub.p). i.e. distance is calculated and minimized, now also including the feature wherein the first and second embedding vectors are sentence embedding vectors as taught by NG, see paras. 46 and 56);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.

Regarding claim 8, is directed to a system claim corresponding to the method claim
presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.

Regarding claim 10, is directed to a system claim corresponding to the method claim
presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.

Regarding claim 11, is directed to a system claim corresponding to the method claim
presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.

Regarding claim 15, is directed to a system claim, with additional limitations, similar to
method claim presented in claim 1 and is rejected under the same ground above regarding
claim 1.
Visser teaches an apparatus for identifying a speaker with respect to an input voice
signal by utilizing a speaker identification neural network (Para. 502, device 4200 to perform
methods listed involving DNN for speaker verification), the apparatus comprising:
a memory storing at least one program (Para. 502, Where the memory 4224 contains
instructions 4260 for the device on figure 42); and
a processor (Para. 503, One or more components of the systems 100, 200, 2000, 2100, 2500, 2800, 2900, 3200, 3300, and/or 3400 may be implemented via dedicated hardware (e.g., circuitry), by a processor executing instructions to perform one or more tasks, or a combination thereof) configured to:
extract a feature from the input voice signal (Para. 507, Visser teaches the validation
module 202 may receive the audio command signal 132 of FIG. 1. The validation module 202
may extract features of the audio command signal 132 where “one or more devices configured to determine whether the input audio signal satisfies the speaker verification validation criterion (e.g., a processor executing instructions at a non-transitory computer readable storage medium), or any combination thereof”),
input the feature to the speaker identification neural (The features in element 3316 are
inputted into the user’s model 3292 through the enrollment module which is part of the
processor(s) 4210 as seen on figure 42), 
wherein the first neural network is trained to identify a first speaker with respect to a first voice signal in a first environment in which a first signal-to-noise (SNR) value of the first voice signal is greater than or equal to a threshold value (Para. 400, speaker verification to access the application i.e. identify number of speakers in this case a first speaker; Para. 406, First speaker model generated, 3292 taught to have a first signal-to-noise ratio; Para. 462 The speech validator 3206 may determine characteristics based on a model such as a GMM, DNN, or other model; The threshold value for the neural networks are interpreted as arbitrary and the voice signals can be characterized differently when compared in magnitude to a specified threshold value, see para 123 where SNR exceeds or does not exceed a threshold and speaker models are chosen accordingly their training),
and wherein the second neural network is trained to identify a second speaker with respect to a second voice signal in a second environment in which a second SNR value of the second voice signal is less than the threshold value (Para. 400, speaker verification to access the application i.e. identify number of speakers in this case a second speaker; Para. 406, A second speaker model generated, 3294 where it has a second signal-to-noise ratio; Para. 462 The speech validator 3206 may determine characteristics based on a model such as a GMM, DNN, or other model; The threshold value for the neural networks are interpreted as arbitrary and the voice signals can be characterized differently when compared in magnitude to a specified threshold value, see para 123 where SNR exceeds or does not exceed a threshold and speaker models are chosen accordingly their training),
extract the second embedding vector corresponding to the feature from the speaker identification neural network (Para. 469, Speaker verification method of selecting a speaker model trained in conditions corresponding to the predicted SNR, noise, or both where the identification used is an embedding vector i.e. embedding vector selected); and 
identify the speaker with respect to the input voice signal based on the second embedding vector (Para. 479, indicating whether the input audio signal satisfies a verification criterion with reference to figs 1 and 32 where, para. 127, indicates the speaker model 176 may be a model of speech associated with an authorized (e.g., enrolled) user and identification of such models is done through embedding vectors, in this case with the second).
	However, Visser fails to explicitly disclose:
that is generated by training a second neural network according to a teacher-student training model in which a first neural network is set to a teacher neural network and the second neural network is set to a student neural network
based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal,
based on a second  sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal; 
extract the second sentence embedding vector corresponding to the feature from the speaker identification neural network;
identify the speaker with respect to the input voice signal based on the second
sentence embedding vector;
wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors.
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. The method 500 may be performed by one or more components of a speaker recognition system such as the speaker recognition system 100 described above. For example, the method 500 may be performed by the controller 120. In an example, at least one portion of the method 500 is implemented by executable code, stored on a non-transitory storage medium, that includes instructions, that when executed by at least one processor, causes the at least one processor to perform the at least one portion of the method 500 described herein, this is the same for apparatus 600 as well, see para. 60), NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer. Furthermore, Various measures (for example systems, methods, computer programs and computer-readable media) are provided in which a first neural network is trained to be used in speaker recognition. An embedding vector is extracted from the first neural network, the sentence embedding vector being an intermediate output of the first neural network, see para. 107. NG discloses a teacher-student neural network training may be used in a speaker recognition system. The second ANN may be a relatively large-scale ANN compared to the first ANN. The first ANN may be a relatively small-scale ANN compared to the second ANN. In some examples, the first ANN is 60-95% smaller than the second ANN. In some examples, the first ANN is obtained by compressing the second ANN using a knowledge distillation technique. The first ANN may be considered to be a “compressed” version of the second ANN i.e. first and second are used to distinguish, and in this case,  it is training the student neural network model referred as the first model based on the teacher-student training model in which the teacher neural network model referred as the second model; however, the student neural network model is what the model being trained after the teacher model, see para. 37. These sentence embeddings from the second and first neural network as per para. 56 correspond to sentence frames of the first and second voice signal as per the input, see para. 29. Finally, data reflect read and spontaneous speech from a large number of speakers with various acoustic channel conditions. The training data may be representative of real speech and may be sufficiently diverse to prevent overtraining and/or overfitting. In some cases, the data set used for training is modified to simulate babble noise, music noise, additive noise and/or reverberation. Additionally or alternatively, the training data may contain real noise, reverberation, intra-speaker variability and/or compression artefacts. The training data set may comprise out-of-domain data and/or in-domain data i.e. the SNR is representative of the various conditions given, see para. 54. 
Modifying Visser to include the features of NG further discloses:
input the feature to the speaker identification neural network that is generated by training the second neural network based on a teacher-student training model in which the first neural network is set to a teacher neural network and the second neural network is set to a student neural network (e.g. Visser’s speaker verification neural network method of two models, now including the feature where the second model is trained on a teacher-student training model where the first neural network is the teacher and the second neural network is the student, see para. 37, as taught by NG).
wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector that are output from a last hidden layer of the first
neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal (e.g. Visser’s speaker verification neural network where the first neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector that are output from a last hidden layer of the first neural network and intermediate to the output layer i.e. immediately before, see para. 29, 46, 56, and 107, as taught by NG);
wherein the second neural network is configured to identify the second speaker based on a second sentence embedding vector that are output from a last hidden layer of the
second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector that are output from a last hidden layer of the second neural network and intermediate to the output layer i.e. immediately before, see paras. 29, 46, 56, and 107, as taught by NG);
extract the second sentence embedding vector corresponding to the feature from the speaker identification neural network (Para. 469, Speaker verification method of selecting a speaker model trained in conditions corresponding to the predicted SNR, noise, or both where the identification used is an embedding vector i.e. embedding vector selected, now also including the feature wherein the second embedding vector is a second sentence embedding vector as taught by NG, see paras. 46 and 56);
identify the speaker with respect to the input voice signal based on the second
sentence embedding vector (Para. 479, indicating whether the input audio signal satisfies a verification criterion with reference to figs 1 and 32 where, para. 127, indicates the speaker model 176 may be a model of speech associated with an authorized (e.g., enrolled) user and identification of such models is done through embedding vectors, in this case with the second, now also including the feature wherein the second embedding vector is a second sentence embedding vector as taught by NG, see paras. 46 and 56);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits as recent research has proposed the use of artificial neural networks (ANNs) to perform speaker recognition. In some speaker recognition scenarios, trained ANNs have been shown to offer similar or improved accuracy relative to I-vector systems, see para. 7. Furthermore, embedding vectors extracted from an intermediate layer of the second ANN may be a more reliable training target for the first ANN than an output from the final layer of the second ANN. As such, using embedding vectors as training targets for the first ANN may result in the first ANN having a greater speaker recognition accuracy than using other data as training targets as taught by NG, see para. 93. Furthermore, Compressing the first speaker recognition model using a knowledge distillation compression technique enables speaker recognition to be performed with a reduced requirement of processing, storage, latency and/or power whilst maintaining a sufficient level of accuracy as taught by NG, see para. 104; moreover, A teacher-student-trained artificial neural network may perform text-independent speaker recognition more accurately and/or with a smaller footprint in terms of power, storage, latency and/or processing requirements, compared to other speaker recognition systems as taught by NG, see para. 109. Furthermore, by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.
	Visser in view of NG does not disclose:
wherein the first neural network is configured to identify the first speaker based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal;
wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal;
In a related field of endeavor, e.g. audio linguistic embeddings, see abstract. Haque teaches a uniform average intermediate embeddings are converted to a sentence embedding by computing an element-wise sum of the intermediate embeddings at each word or phoneme position and dividing by the number of words in a sentence i.e. weighted sum, see section 3.2.2. first item “Uniform Average”; therefore, the first sentence embedding vector is a weighted sum of the first embedding vectors and the second sentence embedding vector is a weighted sum of the second embedding vectors. 
Modifying Visser to include the features of NG further and further in view of Haque discloses:
wherein the first neural network is configured to identify the first speaker based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network that is provided immediately before an output layer of the first neural network, and that correspond to sentence frames of the first voice signal (e.g. Visser’s speaker verification neural network where the first neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the first neural network is configured to identify the first speaker  based on a first sentence embedding vector representing a weighted sum of first embedding vectors that are output from a last hidden layer of the first neural network and intermediate to the output layer i.e. immediately before, see para. 29, 46, 56, and 107, as taught by NG, and further in view of Haque’s teaching of the sentence embedding vector representing a weighted sum of first embedding vectors, see section 3.2.2. first item “Uniform Average”);
wherein the second neural network is configured to identify the second speaker  based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network that is provided immediately before an output layer of the second neural network, and that correspond to sentence frames of the second voice signal (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, now wherein the second neural network is configured to identify the second speaker based on a second sentence embedding vector indicating a weighted sum of second embedding vectors that are output from a last hidden layer of the second neural network and intermediate to the output layer i.e. immediately before, see paras. 29, 46, 56, and 107, as taught by NG, and further in view of Haque’s teaching of the sentence embedding vector representing a weighted sum of second embedding vectors, see section 3.2.2. first item “Uniform Average”);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Haque to the disclosure of Visser in view of NG. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the three disclosures, for example, speech recognition. Further, doing so would have provided the users of Visser in view of NG, with added benefits learning long-term dependencies by modeling speech at the sentence level; furthermore, results have shown the spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks i.e. captures more information and is able to compress it statistically, see abstract. 
	Visser in view of NG and further in view of Haque fails to disclose:
	wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors.
	In a related field of endeavor, e.g. speaker verification, see abstract. Marc teaches, A sequence summarizing network with an attention mechanism is an auxiliary network described above with an attention mechanism. In the above auxiliary network, when the auxiliary information λ is obtained, the frame-wise vectors extracted from each time frame are integrated with equal weights, but the weights can be adjusted by using the attention mechanism. For example, the attention mechanism is learned so that the weight of the frame-wise vector extracted from the time frame containing a lot of noise is small and the weight of the frame-wise vector extracted from the time frame with less noise is large. It is possible to appropriately obtain auxiliary information representing the characteristics of the voice signal of the target speaker. The operation of the sequence summarizing network with the attention mechanism will be described in detail in the description of the modification of the first embodiment i.e. the speaker identification neural network uses an attention mechanism i.e. layer to adjust the initial weight of the second neural network such that a relatively large weight is applied to voice and a low weight is applied to noise and if noise is much larger than the voice signal, the voice signal might not exist and this is according to the second neural network, see para. 3 on pg. 3 and para. 3 of pg. 5. 
Modifying Visser to include the features of NG further and further in view of Haque and further in view of Marc discloses:
wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors (e.g. Visser’s speaker verification neural network where the second neural network initially used i-vector embeddings for speaker identification, see para. 484, modified by NG and Haque for the embedding vectors and teacher-student neural network setup, now also including the feature wherein the speaker identification neural network comprises an attention layer to adjust the initial weights of the second neural network such that a relatively high weight is assigned to an embedding vector of a period in which a voice signal exists, and a relative low weight is assigned to an embedding vector of a period in which a noise signal exists and any voice signal does not exist, among the second embedding vectors as taught by Marc, see para. 3 on pg. 3 and para. 3 of pg. 5).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Marc to the disclosure of Visser in view of NG and further in view of Haque. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the three disclosures, for example, speech processing. Further, doing so would have provided the users of Visser in view of NG further in view of Haque, with added benefits as the increase in the number of parameters with the increase in the number of units is suppressed, so that the memory capacity for storing the trained model can be reduced according to the first embodiment; furthermore, not only the memory capacity but also the consumption of other computer resources such as processor time and disk IO can be reduced, see paras. 1-2 on pg. 5.

Regarding claim 16, the combination of Visser in view of NG further in view of Haque and Marc teaches apparatus claim 15, Additionally Visser discloses:
wherein the feature comprises at least one of a fast Fourier transform (FFT) amplitude,
an FFT power, a log FFT amplitude, a log band energy, a mel band energy, and a mel frequency
cepstral coefficients (MFCC) of the input voice signal (Para. 419, The enrollment module 108
may receive the input audio signal 3230 from the first user3252, as described with reference to FIG. 32. For example, the input audio signal 3230 may correspond to the enrollment phrase
audio signal 130 of FIG.1. The enrollment module 108 may generate (or update) a speaker
model (e.g., the first speaker model 3292) based on the input audio signal 3230. For example,
the enrollment module 108 may extract first features 3316 (e.g., mel-frequency cepstrum
coefficients (MFCC)) corresponding to the input audio signal 3230).	

	Regarding claim 17, the combination of Visser in view of NG further in view of Haque and Marc teaches apparatus claim 15, additionally Visser discloses wherein the processor is further configured to:
store the first embedding vector of the speaker identification neural network with respect to a voice signal of a specific speaker in the memory as a registration embedding vector (Para. 106 and figure 1, The enrollment module 108 as integrated with the processor may store an association between the user 152 corresponding to the authentication data and the speaker model 176 in the memory 122; furthermore, Speaker verifier 3220 corresponds to the processor 3203 and the speaker verification data 3280 corresponds to the memory 3222 as seen on figure 32); and
identify whether the input voice signal represents the specific speaker based on a
similarity between the second embedding vector of the input voice signal and the registration embedding vector (Para. 419 and figure 34, Utterance vector 3404 from the input voice signal to undergo enrollment testing module 108 and a first score is developed through the scoring module 3310 between test phrase signal 134 and the speaker model 3292; however, para. 407 specifies the speech validator 3206 may select a speaker model (e.g., the first speaker model 3292 or the second speaker model 3294 and the example illustrates first embedding vector of the first neural network but the second embedding vector of the second neural network may be used; furthermore, the enrollment module 108 and testing module 110 are processed by processor(s) 4210 and memory 4224 as seen on figure 42 and disclosed on lines 11-13 on paragraph 0429; Claim interpretation on registration embedding vector is that if it is stored then it is registered).
	Visser does not disclose:
store the first sentence embedding vector of the speaker identification neural network with respect to a voice signal of a specific speaker in the memory as a registration embedding vector;
identify whether the input voice signal represents the specific speaker based on a
similarity between the second sentence embedding vector of the input voice signal and the registration embedding vector;
 In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer.
Modifying Visser to include the features disclosed by NG discloses:
store the first sentence embedding vector of the speaker identification neural network with respect to a voice signal of a specific speaker in the memory as a registration embedding vector (Para. 106 and figure 1, The enrollment module 108 as integrated with the processor may store an association between the user 152 corresponding to the authentication data and the speaker model 176 in the memory 122; furthermore, Speaker verifier 3220 corresponds to the processor 3203 and the speaker verification data 3280 corresponds to the memory 3222 as seen on figure 32, now also including the feature where the first embedding vector is a first sentence embedding vector as taught by NG, see paras. 46 and 56);
identify whether the input voice signal represents the specific speaker based on a
similarity between the second sentence embedding vector of the input voice signal and the registration embedding vector (Para. 419 and figure 34, Utterance vector 3404 from the input voice signal to undergo enrollment testing module 108 and a first score is developed through the scoring module 3310 between test phrase signal 134 and the speaker model 3292; however, para. 407 specifies the speech validator 3206 may select a speaker model (e.g., the first speaker model 3292 or the second speaker model 3294 and the example illustrates first embedding vector of the first neural network but the second embedding vector of the second neural network may be used; furthermore, the enrollment module 108 and testing module 110 are processed by processor(s) 4210 and memory 4224 as seen on figure 42 and disclosed on lines 11-13 on paragraph 0429; Claim interpretation on registration embedding vector is that if it is stored then it is registered, now also including the feature where the second embedding vector is a second sentence embedding vector as taught by NG, see paras. 46 and 56);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.

	Regarding claim 18, the combination of Visser in view of NG further in view of Haque and Marc teaches apparatus claim 17, In addition Visser discloses: 
wherein the similarity between the second embedding vector with respect to the input voice signal and the registration embedding vector is calculated based on at least one of a cosine similarity and a probabilistic linear discriminant analysis (PLDA) (Para. 314, Teaches a score that may be determined by equation 7 which is a cosine similarity equation between a target i-vector, embedding vector as taught by NG, in this case from the speaker model and a test i-vector, embedding vector as taught by NG, from the input voice signal seen as the
dominant audio signal 2550 as seen through equation 7).
	Visser does not explicitly disclose:
wherein the similarity between the second sentence embedding vector with respect to the input voice signal and the registration embedding vector is calculated based on at least one of a cosine similarity and a probabilistic linear discriminant analysis (PLDA);
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. NG discloses, the sentence embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The sentence embedding vectors as per para. 56 may be generated by an intermediate layer of the second ANN. In other words, the sentence embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The sentence embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second sentence embedding vector from both neural networks (first sentence embedding vector and second sentence embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer.
Modifying Visser to include the features disclosed by NG discloses:
wherein the similarity between the second sentence embedding vector with respect to the input voice signal and the registration embedding vector is calculated based on at least one of a cosine similarity and a probabilistic linear discriminant analysis (PLDA) (Para. 314, Teaches a score that may be determined by equation 7 which is a cosine similarity equation between a target i-vector, embedding vector as taught by NG, in this case from the speaker model and a test i-vector, embedding vector as taught by NG, from the input voice signal seen as the
dominant audio signal 2550 as seen through equation 7, now also including the feature wherein the second embedding vector is now a second sentence embedding vector as taught by NG, see paras. 46 and 56);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits by having sentence level embeddings that correspond to the sentence frames for the first voice signal and the second voice signal may be considered a short duration of 10 seconds for example, may reduce a latency associated with speaker recognition, and may consequently facilitate more natural user interactions; further may enable speaker enrolment and/or recognition to be performed without the speaker having to recite lengthy statements or dialogues, see para. 29.

Claims 5 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over
Visser in view of NG further in view of Haque and Marc and further in view of Saon et al. (U.S. 2015/0161522 A1) herein after Saon.
	Regarding claim 5, the combination of Visser in view of NG further in view of Haque and Marc teaches the method of claim 1; however, while NG teaches cross entropy in the training of the neural networks, see para. 57, 
combination of rejected claim 1 fails to explicitly disclose:
wherein the generating of the first neural network comprises generating the first neural
network by training the first neural network such that a first cross entropy between a first
training target of the first neural network and a third output of the first neural network is
minimized, and wherein the generating of the speaker identification neural network comprises
training the second neural network such that a second cross entropy between a second training
target of the second neural network and a fourth output of the second neural network is
minimized.
In a related field of endeavor (speech recognition using multiple neural networks, having
computer readable program instructions thereon for causing a processor to carry out aspects of
the present invention, see para. 70), Saon discloses an error function E, which measures the
classification or regression error of the hybrid model where E can be a mean squared error,
cross-entropy, or other suitable functions that measures the discrepancy between the output
of the network and the target, see para. 29. Furthermore, stating that the “entire hybrid
network was trained by minimizing a cross-entropy objective function”, see para. 38, i.e. each
neural network has their respective training target and output and minimizing the cross-
entropy is an objective function.
Modifying Visser in view of NG and further in view of Haque and Marc to include the features of Saon further discloses:
wherein the generating of the first neural network comprises generating the first neural
network by training the first neural network such that a first cross entropy between a first
training target of the first neural network and a third output of the first neural network is
minimized, and wherein the generating of the speaker identification neural network comprises
training the second neural network such that a second cross entropy between a second training
target of the second neural network and a fourth output of the second neural network is
minimized (e.g. Visser’s speaker verification neural network where it is trained as a student-
teacher model as taught by NG and further in view of Haque and Marc, now including the features where the neural network is trained as such first cross entropy between the first training target of the teacher model and a third output of the teacher neural network is minimized and a second cross entropy between a second training target of the student model and a fourth output of the student neural network is minimized, see para. 29 and 38, as taught by Saon).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings Saon to the disclosure of Visser in view of NG. Doing so would
have been predictable to one of ordinary skill in the art given the similar nature between the
two disclosures, for example, speech recognition using multiple neural networks. Further, doing
so would have provided the users of Visser in view of NG, with the benefit of indication on how
to update the weights of the models in accordance with the suitable function to measure
discrepancy between the output of the network and the target, see para. 29, as taught by Saon.

Regarding claim 12, is directed to a system claim corresponding to the method claim
presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.


Claims 2, 6-7, and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable
Over Visser in view of NG further in view of Haque and Marc and further in view of Li et al. (US Pub. No. US 2019/0051290 A1) hereinafter Li.
Regarding claim 2, the combination of Visser in view of NG and further in view of Haque and Marc teaches the method of claim 1, additionally NG discloses using the same motivation above for claim 1:
wherein the first neural network comprises a first embedding layer and the second neural network comprises a second embedding layer (Para. 46, the embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The embedding vectors may be generated by an intermediate layer of the second ANN. In other words, the embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below, see para. 46 i.e. there is a first and second embedding vector from both neural networks (first embedding vector and second embedding vector respective to the neural networks) intermediate of the output layer i.e. immediately before and outputted from the last hidden layer i.e. first and second neural networks have a first and second embedding layer respectively where the embedding vector is resulted), 
However Visser in view of NG and further in view of Haque and Marc fails to explicitly disclose:
and wherein a number of channels included in the first embedding layer of the first neural network is equal to a number of channels included in the second embedding layer of the second neural network to allow the second neural network to be trained according to the teacher-student training model.
In a related field of endeavor (speech recognition via teacher-student training and using
multiple neural networks, a processor and a memory storage device including instructions that
when executed by the processor enables the system to perform teacher-student training, see
para. 76-77), Li discloses that the initial student model 160 is a clone of the teacher model 150,
see para. 42, i.e. layout and weights included of input, hidden layers, and output layer, and
where the Neural Network comprises a series of neurons, such as Long Short-Term Memory
nodes, arranged in a network, see para. 35. The embedding layer overall is still a hidden layer in the neural network, it gets its name from the embedding vector that is resulted; therefore, the cloning of the student model where the channels i.e. interpreted as the series of neurons to connect links/channels, to the teacher model means that the number of channels in the layer are equal as figure 2 depicts a flowchart showing general stages involved in an example method 200 for student/teacher training for speech recognition.
	Modifying Visser in view of NG and further in view of Haque and Marc to include the features of Li further discloses:
wherein the first neural network comprises a first embedding layer and the second neural network comprises a second embedding layer, and wherein a number of channels included in the first embedding layer of the first neural network is equal to a number of channels included in the second embedding layer of the second neural network to allow the second neural network to be trained according to the teacher-student training model (e.g. Visser’s speaker verification neural network where NG discloses the use of embedding layers in both neural networks as taught in para. 46 and further in view of Haque and Marc teachings, now also including the feature where the numbers of channels included in the first embedding layer of the first neural network is equal to a number of channels included in the second embedding layer of the second neural network to allow the second neural network to be trained according to the teacher-student training model as taught by Li, see figure 2 para. 42).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings Li to the disclosure of Visser in view of NG. Doing so would have
been predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speech recognition via student-training training and using multiple
neural networks. Further, doing so would have provided the users of Visser in view of NG, with
the benefit to more accurately recognize speech in the domain for which is was adapted by
minimizing the divergence score as recognized by Li, see para. 42.

Regarding claim 6, the combination of Visser in view of NG and further in view of Haque and Marc teaches the method of claim 1; however, while both Visser and NG disclose the user of multiple neural networks,
combination of claim 1 fails to explicitly disclose:
wherein the generating of the second neural network comprises generating the second
neural network to have the same number of layers and nodes as the number of layers and
nodes of the first neural network.
In a related field of endeavor (speech recognition via teacher-student training and using
multiple neural networks, a processor and a memory storage device including instructions that
when executed by the processor enables the system to perform teacher-student training, see
para. 76-77), Li discloses that the initial student model 160 is a clone of the teacher model 150,
see para. 42, i.e. layout and weights included of input, hidden layers, and output layer, and
where the Neural Network comprises a series of neurons, such as Long Short-Term Memory
nodes, arranged in a network, see para. 35).
Modifying Visser in view of NG further in view of Haque and Marc to include the features of Li further discloses:
wherein the generating of the second neural network comprises generating the second
neural network to have the same number of layers and nodes as the number of layers and
nodes of the first neural network (e.g. Visser’s speaker verification neural network where the
first neural network is set to the teacher model and the second neural network is set to the
student model as taught by NG, now including the feature where the generating of the student model is to have the same number of layers and nodes as the teacher model, see para. 35 and 42, as taught by Li).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings Li to the disclosure of Visser in view of NG and further in view of Haque and Marc. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the two disclosures, for example, speech recognition via student-training training and using multiple
neural networks. Further, doing so would have provided the users of Visser in view of NG further in view of Haque and Marc, with the benefit to more accurately recognize speech in the domain for which is was adapted by minimizing the divergence score as recognized by Li, see para. 42.

Regarding claim 7, the combination of Visser in view of NG further in view of Haque and Marc teaches the method of claim 1; however, fails to explicitly disclose:
wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network in a direction in which the second sentence embedding vector output from the last hidden layer of the second neural network becomes closer to the first sentence embedding vector output from the last hidden layer of the first neural network.
In a related field of endeavor (speech recognition via teacher-student training and using
multiple neural networks, a processor and a memory storage device including instructions that
when executed by the processor enables the system to perform teacher-student training, see
para. 76-77), Li discloses that the initial student model 160 is a clone of the teacher model 150,
see para. 42, i.e. layout and weights included of input, hidden layers, and output layer, and
where the Neural Network comprises a series of neurons, such as Long Short-Term Memory
nodes, arranged in a network, see para. 35. The embedding layer overall is still a hidden layer in the neural network, it gets its name from the embedding vector that is resulted; therefore, the cloning of the student model where the channels i.e. interpreted as the series of neurons to connect links/channels, to the teacher model means that the number of channels in the layer are equal as figure 2 depicts a flowchart showing general stages involved in an example method 200 for student/teacher training for speech recognition; furthermore, the weights and biases are cloned as seen through para. 42.
	Modifying Visser in view of NG and further in view of Haque and Marc to include the features of Li further discloses:
wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network (e.g. Visser’s speaker identification method now also including the feature wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network as taught by Li, see para. 76-77 and 42);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings Li to the disclosure of Visser in view of NG further in view of Haque and Marc. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the two disclosures, for example, speech recognition via student-training training and using multiple neural networks. Further, doing so would have provided the users of Visser in view of NG, with the benefit As will be appreciated, during the course of method 200, those weightings, Neural Networks of the student model 160 will be modified from their initial values or layouts to more accurately recognize speech in the domain for which the student model 160 is adapted by minimizing the divergence score calculated between the posteriors generated by the teacher model 150 and the student model 160, as taught by Li, see para. 42. As the final values of the teacher model are used for the initial training of the student model to determine convergence, In successive iterations of training the student model 160 the successive parallel batches will be fed to the teacher model 150 and the student model 160 to produce successive posteriors, which will be compared again against one another until a maximum number of epochs is reached, the divergence score satisfies a convergence threshold, divergence plateaus, or training is manually stopped; furthermore as taught on para. 45, the student model may be more accurate that the teacher model in some cases for accurately recognizing speech, but is judged based on the similarity of its results to the results of the teacher model. Further, doing so would have provided the users of Visser in view of NG, with the benefit to more accurately recognize speech in the domain for which is was adapted by minimizing the divergence score as recognized by Li, see para. 42.
	Visser in view of Li fails to explicitly disclose:
wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network in a direction in which the second sentence embedding vector output from the last hidden layer of the second neural network becomes closer to the first sentence embedding vector output from the last hidden layer of the first neural network.
In a related field of endeavor (Knowledge distillation in recognition of speech, see para.
12 on pg. 6 Referring to FIG. 5, there is shown an example of a method 500 for performing speaker recognition. NG discloses, the first ANN has been trained based on embedding vectors, as will be described in more detail below. The embedding vectors may be generated by the second ANN, e.g. during training of the second ANN to perform speaker recognition. The embedding vectors may be generated by an intermediate layer of the second ANN. In other words, the embedding vectors may be generated by a layer other than the final layer of the second ANN. In some examples, the embedding vectors have a higher dimensionality than the output of the final layer of the second ANN. The embedding vectors generated by the second ANN may be used as an input for training the first ANN, in order to train the first ANN to emulate the results of the second ANN, as will be described in more detail below. Embedding vectors extracted from an intermediate layer of the second ANN may be a more reliable training target for the first ANN than an output from the final layer of the second ANN. As such, using embedding vectors as training targets for the first ANN may result in the first ANN having a greater speaker recognition accuracy than using other data as training targets. Moreover, embedding vectors generated by an ANN may be re-used in other tasks, enabling multi-task learning and performance. For example, the generated embedding vectors may be applicable in tasks such as voice trigger, speech recognition, psychometric analysis, user profiling and emotion recognition, see para. 46.
	Modifying Visser to include the features disclosed by NG, Haque, Marc, and Li discloses:
wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network in a direction in which the second sentence embedding vector output from the last hidden layer of the second neural network becomes closer to the first sentence embedding vector output from the last hidden layer of the first neural network (e.g. Visser’s speaker identification method now also including the feature wherein the second neural network is generated using weights and biases of the first neural network of which training is completed, as initial weights and initial biases of the second neural network, and is trained to adjust the initial weights and initial biases of the second neural network as taught by Li, see para. 76-77 and 42, now also including the feature where in a direction in which the second sentence embedding vector output from the last hidden layer of the second neural network becomes closer to the first sentence embedding vector output from the last hidden layer of the first neural network, as taught by NG, see para. 46);
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of NG to the disclosure of Visser. Doing so would have been
predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speaker verification using multiple neural networks. Further, doing so
would have provided the users of Visser, with added benefits as recent research has proposed the use of artificial neural networks (ANNs) to perform speaker recognition. In some speaker recognition scenarios, trained ANNs have been shown to offer similar or improved accuracy relative to I-vector systems, see para. 7. Furthermore, embedding vectors extracted from an intermediate layer of the second ANN may be a more reliable training target for the first ANN than an output from the final layer of the second ANN. As such, using embedding vectors as training targets for the first ANN may result in the first ANN having a greater speaker recognition accuracy than using other data as training targets as taught by NG, see para. 93. Furthermore, Compressing the first speaker recognition model using a knowledge distillation compression technique enables speaker recognition to be performed with a reduced requirement of processing, storage, latency and/or power whilst maintaining a sufficient level of accuracy as taught by NG, see para. 104; moreover, A teacher-student-trained artificial neural network may perform text-independent speaker recognition more accurately and/or with a smaller footprint in terms of power, storage, latency and/or processing requirements, compared to other speaker recognition systems as taught by NG, see para. 109.

Regarding claim 13, is directed to a system claim corresponding to the method claim
presented in claim 6 and is rejected under the same grounds stated above regarding claim 6.

Regarding claim 14, the combination of Visser in view of NG further in view of Haque and Marc teaches the system of claim 8; however, Visser in view of NG fails to explicitly disclose:
Wherein the generating of the speaker identification neural network comprises training the second neural network by setting a final value of a first training parameter of the first neural network to an initial value of a second training parameter of the second neural network.
In a related field of endeavor (speech recognition via teacher-student training and using
multiple neural networks, a processor and a memory storage device including instructions that
when executed by the processor enables the system to perform teacher-student training, see
para. 76-77), Li discloses in para. 46, results from the teacher model 150 and student model 160 are back propagated to the student model 160 in light of the divergent results from the training as to then update the student model 160 in light of the results; furthermore, this is demonstrated in figure 2 where an already trained teacher model 150 associated with a dataset of source domain data 130 is selected. In various aspects, the teacher model 150 is selected based on a language, a dialect, an accent pattern, or the like… The source domain data 130 and the target domain data 140 are forward propagated to the teacher model 150 and the student model 160, respectively, at OPERATION 230. In some aspects, all of the target domain data 140 and associated source domain data 130 are forward propagated, while in other aspects a sub-set or batch of the target domain data 140 and associated source domain data 130 are forward propagated where Proceeding to DECISION 250, it is determined whether the behavior of the student model 160 converges with the behavior of the teacher model 150. In various aspects, the convergence is calculated as a Kullback-Leibler divergence as shown in FORMULA 1, as a modified Kullback-Leibler divergence as shown in FORMULA 2, or as another divergence score. When the divergence converges below a convergence threshold, it indicates that the student model 160 is able to recognize speech in its given domain almost as well as the teacher model 150 is able to recognize speech in its domain. When the divergence score does not satisfy the convergence threshold, it indicates that the student model 160 has not yet converged with the teacher model 150, and will require adjustment to its parameters as at OPERATION 260, the results from the teacher model 150 and the student model 160 are back propagated to the student model 160, to thereby update the parameters of the student model 160 in light of the divergent results. As will be appreciated, various machine learning techniques may be used to update the student model 160 in light of the results i.e. the final values of the first training parameters of the first neural network are set to an initial value of a second training parameter of the second neural network and this is done through convergence where it may also occur when a maximum number of training rounds have occurred, a divergence plateau is reached, or when a user manually terminates training early i.e. finalization as the updating of the student model is initialized in its second training parameters with finalized values for the first training parameter of the teacher model, see paras. 42-48 as description for figure 2 with emphasis on elements forward propagate parallel data to teacher and student models 230, calculate posterior weights 240, convergence 250, back propagate to update student model 260, and finalize student model 270.
Modifying Visser in view of NG to include the features of Li further discloses:
wherein the generating of the speaker identification neural network comprises training
the second neural network by setting a final value of a first training parameter of the first
neural network to an initial value of a second training parameter of the second neural network
(e.g. Visser’s speaker verification neural network now including the feature where it comprises training the second neural network by setting a final vale of a first training parameter of the first neural network to an initial value of a second training parameter of the second neural network as taught by Li, see fig. 2 elements forward propagate parallel data to teacher and student models 230, calculate posterior weights 240, convergence 250, back propagate to update student model 260, and finalize student model 270 and paras. 45-46).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings Li to the disclosure of Visser in view of NG. Doing so would have
been predictable to one of ordinary skill in the art given the similar nature between the two
disclosures, for example, speech recognition via student-training training and using multiple
neural networks. Further, doing so would have provided the users of Visser in view of NG, with
the benefit As will be appreciated, during the course of method 200, those weightings, Neural Networks of the student model 160 will be modified from their initial values or layouts to more accurately recognize speech in the domain for which the student model 160 is adapted by minimizing the divergence score calculated between the posteriors generated by the teacher model 150 and the student model 160, as taught by Li, see para. 42. As the final values of the teacher model are used for the initial training of the student model to determine convergence, In successive iterations of training the student model 160 the successive parallel batches will be fed to the teacher model 150 and the student model 160 to produce successive posteriors, which will be compared again against one another until a maximum number of epochs is reached, the divergence score satisfies a convergence threshold, divergence plateaus, or training is manually stopped; furthermore as taught on para. 45, the student model may be more accurate that the teacher model in some cases for accurately recognizing speech, but is judged based on the similarity of its results to the results of the teacher model. Further, doing so would have provided the users of Visser in view of NG, with the benefit to more accurately recognize speech in the domain for which is was adapted by minimizing the divergence score as recognized by Li, see para. 42.



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure. 
Lee et al. (US 2018/0190268 A1) hereinafter Lee discusses, A speech recognizing
method and apparatus is provided. A speech recognizing method, implementing a speech recognizing model neural network for recognition of a speech, includes determining an attention weight based on an output value output by at least one layer of the speech recognizing model neural network at a previous time of the recognition of the speech, applying the determined attention weight to a speech signal corresponding to a current time of the recognition of the speech, and recognizing the speech signal to which the attention weight is applied, using the speech recognizing model neural network, see abstract. Furthermore, A weight on a signal of a predetermined frequency component may be increased, decreased, or maintained the same in the speech frame input to the speech recognizing model based on the attention weight. For example, in the neural network example, speech frame input may be provided to an input layer of the neural network after which respectively trained weights are applied to the speech frame input before or upon consideration by a next hierarchical layer of the neural network. This trained weight may thus be adjusted by the determined attention weight. An increasing of the weight by the attention weight may correspond to that signal of the frequency component being emphasized or given more consideration when the speech recognizing model estimates a recognition result of the speech frame. Conversely, the decreasing of the weight by the attention weight may correspond to that signal of the frequency component being deemphasized or given less consideration when the speech recognizing model estimates the recognition result of the speech frame. The attention weight may also apply a weight adjustment that can cause a select frequency component to not be considered when the speech recognizing model estimates the recognition result of the speech frame. In a further example, feature values for the different frequency components may have amplitudes represented by sizes of respective bins for the different frequency components, and respectively determined attention weight(s) may be applied to the feature values to selectively adjust the sizes of the respective bins for the different frequency components based on the applied determined attention weight, thereby implementing such maintaining or selective emphasizing of the respective frequency components. Thus, in an example, the attention weighting may perform a role of spectral masking, see para. 78.

Applicant's amendment necessitated the new ground(s) of rejection presented in this
Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the
examiner should be directed to JONATHAN E AMAYA HERNANDEZ whose telephone number is (571)272-2484. The examiner can normally be reached Monday - Friday 8:30 am - 4:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.E.A./             Examiner, Art Unit 2655 

/ANDREW C FLANDERS/             Supervisory Patent Examiner, Art Unit 2655