DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments and amendments in the Amendment filed August 4, 2022 (herein “Amendment”), with respect to the respective objections to claims 1-2 and 9-12, 14-15, 17-20 and therefore all claims depending therefrom, have been fully considered and are persuasive.  The objections to claims 1-2 and 9-12, 14-15, 17-20 and claims depending therefrom have been withdrawn.
Applicant's arguments and amendments in the Amendment regarding the rejection of claims 1 and 11, and claims depending therefrom under 35 U.S.C. 103, have been fully considered but they are only persuasive in part, as new citations to secondary reference Khoury are provided herein regarding the newly amended limitations.
First, it is noted that the amendments to claims 1 and 11, reciting “wherein the neural network as trained includes at least one of a disabled classification layer or fixed hyper-parameters,” are found to be disclosed within cited art Khoury, as set forth in detail below.
Second, Applicant’s remarks in distinguishing the cited art away from the claims focused exclusively on Cartwright. On pages 8-9 of the Amendment, Applicant argues that Cartwright does not teach “enrollment” because instead, Cartwright discloses “training a model” involving a user repeating utterances from different zones in a “training time” operation mode, and in a “run time” mode, the trained model is applied on the user. Applicant contends that neither of Cartwright’s “training time” or “run time” modes is an “enrollment.” However, at least Cartwright’s description of prompting a user to utter a wakeword in an intended zone, where the prompts are repeated (thus several prompts) for multiple zones, is a broadest reasonable interpretation of “enrollment” that is consistent with the specification (see MPEP §2111), given that Applicant provides in their originally filed specification that “During enrollment, an enrollee, such as an end-consumer of the call center system 110, provides several speech examples to the call analytics system 101. For example, the enrollee could respond to various interactive voice response (IVR) prompts of IVR software executed by a call center server 111.”
It is appreciated that the present Application features an enrollment process that uses a trained neural network to generate the enrollment vectors. In this way, Applicant’s amendments directed towards the trained neural network including at least one of a disabled classification layer or fixed hyper-parameters is helpful in distinguishing away from Cartwright, but subsequent claim limitations in the independent claims, such as “generating ... by applying the neural network,” do not refer back to the neural network as being “trained.” That is, the broadest reasonable interpretation of “generating ... by applying the neural network,” still includes an untrained neural network, as the claims do not recite “generating ... by applying the trained neural network.”
However, even if Cartwright were distinguished over with the neural network being recited as a “trained neural network” being applied for the “generating ... an enrollment vector,” and  “generating a speaker vector” limitations, given that Khoury teaches for example in fig. 2B, that enrollment is conducted with a trained convolutional neural network, such further amendments may not overcome the combination of Cartwright and Khoury, or other combinations of other cited art of record. Therefore, despite this minor distinction between Cartwright and the claim limitations, the claims would not be in condition for allowance even with “trained neural network” being recited, thus necessitating this Final Action.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3-7, 10-11, 13-17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cartwright et al., (US 2021/0035563 A1, herein “Cartwright”) in view of Khoury et al., (US 2018/0082692A1, herein “Khoury”).
Regarding claim 1, Cartwright teaches a computer-implemented method comprising (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed): 
training, by a computer (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed), a neural network comprising one or more in-network augmentation layers (Cartwright fig. 1B, para. 148, training of acoustic model including (thus in-network) an augmentation function which paras. 178-186 teach as multiple functions (layers), and paras. 87, 96 and 91 teach the model is implemented by a classifier implementing a neural network) by applying the neural network on a plurality of training audio signals (Cartwright fig. 1B, paras. 153, and 164, input training data is processed by the augmentation functions, then the model is trained from the data augmented by the augmentation functions (the model application and the augmentation functions together being an applying of the neural network)); 
generating (Cartwright fig. 3, paras. 66, 78, speech processing performed by the system of fig. 3, including the production (generating) of a vector of features), by the computer (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed), an enrollment vector for an enrollee (Cartwright para. 78, aggregate feature set is produced from a vector (enrollment vector) of features from audio input, for example in paras. 81-86 where the user (enrollee) is prompted to utter a wakeword multiple times) by applying the neural network on a plurality of enrollment audio signals of the enrollee (Cartwright fig. 3, paras. 66, 78, 81-86, classifier 207 with the model (the neural network) trained by the embodiment in fig. 1B, where the classifier (neural network) takes as input (applying) the aggregate feature set, the aggregate feature set determined from repeated utterances of a user (plurality of enrollment audio signals)); 
receiving, by the computer (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed), a test input audio signal of a speaker (Cartwright para. 78, microphone signals input (receiving) at a discrete time n from a user (speaker) upon which wakeword detection is made (thus the input being test input audio)); 
generating, by the computer (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed), a speaker vector for the speaker (Cartwright paras. 78 and 82-86, each wakeword detector produces a vector of features for every input it receives, thus also for the user utterances collected for a plurality of prompts, a kind of enrollment); and 
generating, by the computer (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed), a likelihood score for the speaker (Cartwright para. 80, classifier outputs signals indicative of probabilities (likelihood) that a user’s utterance is from one of a plurality of zones).
Cartwright does not explicitly teach the speaker vector is by applying the neural network on the test input audio signal or that the likelihood score is indicating a likelihood that the speaker is the enrollee based upon the speaker vector and the enrollment vector. Cartwright further does not explicitly teach wherein the neural network as trained includes at least one of a disabled classification layer or fixed hyper-parameters.
Khoury teaches the speaker vector is by applying the neural network on the test input audio signal (Khoury paras. 58-59, after the neural network is trained, it produces (applying) channel-compensated low-level features (speaker vector) from an input speech signal 212, where the features are relevant to discriminate between speakers).
Khoury further teaches the likelihood score is indicating a likelihood that the speaker is the enrollee based upon the speaker vector and the enrollment vector (Khoury paras. 49, when the CNN is trained, the CNN provides channel-compensated features to the speaker recognition subsystem 20, where paras. 40, 44-45, fig. 1, teach that a user’s utterance to perform speaker identification is a “recognition speech signal” which is positively verified against an enrolled speech sample, resulting in a determination that the recognition speech signal is “genuine” or an “impostor”, where the “genuine” determination is a likelihood score indicating a likelihood that the user who uttered the recognition speech signal is the enrollee).
Khoury still further teaches wherein the neural network as trained includes at least one of a disabled classification layer or fixed hyper-parameters (Khoury para. 74, layers 650 and 660 of the neural network are used in training, but discarded (disabled) at test and enrollment times, where the fully connected layers 650 and output layer 660 are characterized as a back-end classifier (thus classification layers)).
Therefore, taking the teachings of Cartwright and Khoury together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech analytics method of Cartwright with an application of the trained CNN that outputs channel-compensated features and use this compensated signal to determine the likelihood of a present user being genuine or an imposter, and to discard back-end classifier layers after training, as disclosed in Khoury at least because doing so would reduce verification/identification errors (see Khoury para. 48).
Regarding claims 3 and 13, Cartwright teaches wherein the one or more in-network augmentation layers include at least one of: a noise augmentation layer, a frequency augmentation layer, a duration augmentation layer, and an audio clipping layer (Cartwright paras. 178-183, examples of data augmentation types that are applied include fixed spectrum energy noise, variable spectrum semi-stationary noise, non-stationary noise, reverberation noise, simulated echo (another type of noise), microphone equalization and microphone cutoff frequency (frequency based data augmentation), and microphone level (audio clipping)).
Regarding claims 4 and 14, Cartwright does not explicitly teach the limitations of claims 4 and 14. Khoury teaches [disabling, by the computer, - claim 4 / wherein the computer is further configured to disable – claim 14] at least one of the in-network augmentation layers of the trained neural network during a deployment phase (Khoury para. 58, during test and enrollment (deployment phase), the acoustic channel simulator (in-network augmentation layer) is dormant (disabling)).
Therefore, taking the teachings of Cartwright and Khoury together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech analytics method of Cartwright with an application of the trained CNN that outputs channel-compensated features and use this compensated signal to have a dormant acoustic channel simulator in a testing session as disclosed in Khoury at least because doing so would reduce computational and development costs (see Khoury para. 75).
Regarding claims 5 and 15, Cartwright teaches [disabling, by the computer, - claim 5/ wherein the computer is further configured to disable – claim 15] a classification layer during at least one of an enrollment phase and a deployment phase (Cartwright paras. 81-87 and 94, when the user is being prompted to move around and speak/utter certain phrases as given by the system (thus an enrollment phase), unlabeled data can be used, and thus each utterance is not labeled (the labeling being a classification that is disabled), and rather, a clustering of the unlabeled data is performed).
Regarding claims 6 and 16, Cartwright teaches wherein the computer iteratively applies the neural network on the plurality of training signals during two or more stages of a training phase (Cartwright para. 161, the training loop is run for multiple iterations (epochs – using the same training signals), where fig. 1B illustrates the training 131B having multiple stages including augmenting 103B and predicting 105B).
Regarding claims 7 and 17, Cartwright teaches wherein the one or more in-network augmentation layers include a noise augmentation layer (Cartwright paras. 261 and 268, in the training loop for training a model are included augmentation steps, including augmenting clean input feature training data with noise), and wherein [applying the neural network further comprises: - claim 7 / the computer is further configured to: - claim 17] [obtaining, by the computer, - claim 7 / obtain – claim 17] one or more noise audio samples including one or more types of noise (Cartwright fig. 9, paras. 265-268, where 302 is a line designating separation of data preparation from the training loop that trains the model, where paras. 261 and 87 teach the model as part of a neural network, and where the training loop includes augmentation of both stationary and non-stationary noise (one or more noise audio samples) which are generated (obtaining)); and 
[generating, by the computer, - claim 7/ generate – claim 17] one or more simulated noise samples for an input signal (Cartwright paras. 261, 266-267, noise is generated for the input training data features to be augmented) by applying the noise augmentation layer on the one or more noise samples and the input signal (Cartwright paras. 261, 268-269, clean features which are the input features of training data are augmented by combining them with noises 304 and noise 305 to produce dirty features 306), wherein the input signal is one of a training audio signal, an enrollment audio signal, and the test input audio signal (Cartwright para. 261, input features are from the training data (training audio signal)), wherein a subsequent layer of the neural network is applied using the one or more simulated noise samples and the input signal (Cartwright para. 310, final augmented features which include the augmented noise are presented to the model to be trained).
Regarding claims 10 and 20, Cartwright teaches wherein the one or more in-network augmentation layers include an audio clipping layer (Cartwright paras. 185-186, 261 and 273, in the training loop for training a model are included augmentation steps, including augmenting clean input feature training data with leveling and cutoff to features described as “level and microphone cutoff... augmentation”), and wherein [applying the neural network further comprises: - claim 10 / the computer is further configured to: - claim 20] [selecting, by the computer, - claim 10/ select – claim 20] a segment of an input signal having a random duration and occurring at a random time of the input signal (Cartwright paras. 179, 225, in a data augmentation process applied to input features, randomly chosen times are chosen at which noise/distortion events will be inserted, including choosing random frames (random time) as well as random inter-event periods (random duration)), wherein the input signal is one of a training audio signal, an enrollment audio signal, and the test input audio signal (Cartwright para. 261, input features are from the training data (training audio signal)); and 
[generating, by the computer, - claim 10/generate – claim 20] a clipped segment by setting energy values of the segment at a highest energy value or a lowest energy value (Cartwright paras. 185-186, in microphone cutoff and level augmentation a random low frequency cutoff filter is applied, thereby setting the low frequency signal values below the cutoff to a lowest energy value (thus cutoff)), wherein a subsequent layer of the neural network is applied using the clipped segment (Park NPL Abstract, sections 2 , 3.1, 5, the augmentation acts on the log mel spectrogram directly to help the neural network learn useful features (thus during training of the neural network) where the log mel spectrograms are passed in to the network during training, where section 5 discloses that the neural network is trained on augmented data (thus subsequent layers (the neural network) are applied using the frequency-masked augmented data)).
Although Cartwright discloses the randomly chosen times and random inter-event periods in which to apply data augmentation in the example given for augmenting non-stationary noise, and not explicitly for the microphone level and cutoff embodiment of data augmenting, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the microphone level and cutoff data augmentation to include the randomly chosen times and random inter-event periods disclosed for the non-stationary noise data augmentation in Cartwright at least because doing so would more accurately represent real-environment noise conditions from a microphone which Cartwright discloses in para. 12 as changing over time due to unpredictable (random) factors such as how the microphones age and how talkers will move within the acoustic environment (see also Cartwright paras. 10 and 13).
Regarding claim 11, Cartwright teaches a system comprising (Cartwright para. 23, a system including a general purpose processor as a computer system configured to perform the method disclosed): 
a computer comprising a processor and a non-transitory computer readable medium having instructions that when executed by the processor are configured to (Cartwright para. 23, a system including a general purpose processor as a computer system, including a computer readable medium implementing non-transitory storage of data as code for performing any embodiment/method as disclosed):
train a neural network comprising one or more in-network augmentation layers (Cartwright fig. 1B, para. 148, training of acoustic model including (thus in-network) an augmentation function which paras. 178-186 teach as multiple functions (layers), and paras. 87, 96 and 91 teach the model is implemented by a classifier implementing a neural network) by applying the neural network on a plurality of training audio signals (Cartwright fig. 1B, paras. 153, and 164, input training data is processed by the augmentation functions, then the model is trained from the data augmented by the augmentation functions (the model application and the augmentation functions together being an applying of the neural network)); 
generate (Cartwright fig. 3, paras. 66, 78, speech processing performed by the system of fig. 3, including the production (generating) of a vector of features), an enrollment vector for an enrollee (Cartwright para. 78, aggregate feature set is produced from a vector (enrollment vector) of features from audio input, for example in paras. 81-86 where the user (enrollee) is prompted to utter a wakeword multiple times) by applying the neural network on a plurality of enrollment audio signals of the enrollee (Cartwright fig. 3, paras. 66, 78, 81-86, classifier 207 with the model (the neural network) trained by the embodiment in fig. 1B, where the classifier (neural network) takes as input (applying) the aggregate feature set, the aggregate feature set determined from repeated utterances of a user (plurality of enrollment audio signals)); 
receive a test input audio signal of a speaker (Cartwright para. 78, microphone signals input (receiving) at a discrete time n from a user (speaker) upon which wakeword detection is made (thus the input being test input audio)); 
generate a speaker vector for the speaker (Cartwright paras. 78 and 82-86, each wakeword detector produces a vector of features for every input it receives, thus also for the user utterances collected for a plurality of prompts, a kind of enrollment); and 
generate a likelihood score for the speaker (Cartwright para. 80, classifier outputs signals indicative of probabilities (likelihood) that a user’s utterance is from one of a plurality of zones).
Cartwright does not explicitly teach the speaker vector is by applying the neural network on the test input audio signal or that the likelihood score is indicating a likelihood that the speaker is the enrollee based upon the speaker vector and the enrollment vector. Cartwright further does not explicitly teach wherein the neural network as trained includes at least one of a disabled classification layer or fixed hyper-parameters.
Khoury teaches the speaker vector is by applying the neural network on the test input audio signal (Khoury paras. 58-59, after the neural network is trained, it produces (applying) channel-compensated low-level features (speaker vector) from an input speech signal 212, where the features are relevant to discriminate between speakers).
Khoury further teaches the likelihood score is indicating a likelihood that the speaker is the enrollee based upon the speaker vector and the enrollment vector (Khoury paras. 49, when the CNN is trained, the CNN provides channel-compensated features to the speaker recognition subsystem 20, where paras. 40, 44-45, fig. 1, teach that a user’s utterance to perform speaker identification is a “recognition speech signal” which is positively verified against an enrolled speech sample, resulting in a determination that the recognition speech signal is “genuine” or an “impostor”, where the “genuine” determination is a likelihood score indicating a likelihood that the user who uttered the recognition speech signal is the enrollee).
Khoury still further teaches wherein the neural network as trained includes at least one of a disabled classification layer or fixed hyper-parameters (Khoury para. 74, layers 650 and 660 of the neural network are used in training, but discarded (disabled) at test and enrollment times, where the fully connected layers 650 and output layer 660 are characterized as a back-end classifier (thus classification layers)).
Therefore, taking the teachings of Cartwright and Khoury together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech analytics method of Cartwright with an application of the trained CNN that outputs channel-compensated features and use this compensated signal to determine the likelihood of a present user being genuine or an imposter, and to discard back-end classifier layers after training, as disclosed in Khoury at least because doing so would reduce verification/identification errors (see Khoury para. 48).
Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Cartwright in view of Khoury, as set forth above regarding claim 1 from which claim 2 depends, and as set forth above regarding claim 11 from which claim 12 depends, further in view of Lesso (US 11,081,115 B2, herein “Lesso”).
Regarding claims 2 and 12, While Cartwright modified by Khoury teaches verifying a user as being either genuine or an imposter based on input speech that is compared against an enrolled sample, Cartwright in view of Khoury do not teach the claimed [identifying, by the computer, - claim 2/ the computer is further configured to identify - claim 12] the speaker is the enrollee in response to determining that the similarly score satisfies a likelihood threshold.
Lesso teaches identifying, by the computer/ the server is further configured to identify (Lesso col. 15, ll. 19-42, methods disclosed therein embodied as processor control code), the speaker is the enrollee in response to determining that the likelihood score satisfies a similarity threshold (Lesso col. 8, ll. 4-55, weighted features from input speech to be verified are compared to the features of speech obtained during enrollment and a score is produced by comparing a distance metric with a threshold value, the score indicating verification of the user speech as being the same as the enrollment speech).
Therefore, taking the teachings of Cartwright as modified by Khoury and Lesso together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech analytics method of Cartwright with a comparison between user speech and enrolled speech as a distance metric producing a score from a comparison to a threshold as disclosed in Lesso at least because doing so would provide an alternative metric for speaker recognition that can be performed with low power and low computational intensity (see Lesso col. 2, ll. 1-4).
Claims 8-9, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Cartwright in view of Khoury, as set forth above regarding claims 1 and 11 from which claims 8-9 and 18-19 respectively depend, further in view of Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” July 23, 2019, arXiv:1904.08779v2 [eess.AS], https://doi.org/10.48550/arXiv.1904.08779 (herein “Park NPL”).
Regarding claims 8 and 18, Cartwright teaches wherein the one or more in-network augmentation layers include a frequency augmentation layer (Cartwright paras. 261 and 270, in the training loop for training a model are included augmentation steps, including augmenting clean input feature training data and applying equalization and cutoff filtering (thus frequency augmentation)), and [wherein applying the neural network further comprises: - claim 8/ wherein the computer is further configured to: - claim 18] [selecting, by the computer – claim 8/ select – claim 18], a band of frequencies from a frequency domain representing an input signal (Cartwright fig. 9, paras. 265-272, where 302 is a line designating separation of data preparation from the training loop that trains the model, where paras. 261 and 87 teach the model as part of a neural network, and where the training loop includes augmentation of both stationary and non-stationary noise, then applying microphone equalization and microphone cut-off augmentation, described further in paras. 201-204 as including frequencies for the equalization to be applied to the training data), wherein the input signal is one of a training audio signal, an enrollment audio signal, and the test input audio signal (Cartwright para. 261, input features are from the training data (training audio signal)). 
Cartwright does not explicitly teach [generating, by the computer, - claim 8/ generate – claim 18] frequency-masked data for the input signal by applying a mask on the input signal according to the band of frequencies, wherein a subsequent layer of the neural network is applied using the frequency-masked data for the input signal.
Park NPL teaches [generating, by the computer,/generate] frequency-masked data for the input signal by applying a mask on the input signal according to the band of frequencies (Park NPL, section 2, the augmentation policy including applying frequency masking to the base input (input signal), such that f consecutive mel frequency channels (band of frequencies) are masked), wherein a subsequent layer of the neural network is applied using the frequency-masked data for the input signal (Park NPL Abstract, sections 2 , 3.1, 5, the augmentation acts on the log mel spectrogram directly to help the neural network learn useful features (thus during training of the neural network) where the log mel spectrograms are passed in to the network during training, where section 5 discloses that the neural network is trained on augmented data (thus subsequent layers (the neural network) are applied using the frequency-masked augmented data)).
Therefore, taking the teachings of Cartwright and Park NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the augmentation disclosed in Cartwright to include the frequency masking as disclosed in Park NPL at least because doing so would improve the word error rate in automatic speech recognition (Park NPL Abstract).
Regarding claims 9 and 19, Cartwright teaches wherein the one or more in-network augmentation layers include a duration augmentation layer (Cartwright paras. 183, 261 and 270, in the training loop for training a model are included augmentation steps, including augmenting clean input feature training data with simulated echo residuals that gradually increase the magnitude of the added simulated echo residuals for the duration that the utterance is present in the unaugmented training vector), and [wherein applying the neural network further comprises: - claim 9/ and wherein the computer is further configured to: - claim 19] [selecting, by the computer, - claim 9/ select – claim 19] one or more speech segments of an input signal, each respective speech segment having a fixed duration and occurring at a random time in the input signal (Cartwright paras. 183 and 284, added simulated echo residuals based on the utterance energy (speech segments) for the duration that the utterance is present, where Listing 1B details the algorithm, and shows that the magnitude factor for the generated residual echo is randomly selected for each frame, thus the echo magnitude occurring randomly for each frame (time)), wherein the input signal is one of a training audio signal, an enrollment audio signal, and the test input audio signal (Cartwright para. 261, input features are from the training data (training audio signal)).
Cartwright does not explicitly teach for each speech segment, [generating, by the computer – claim 9/generate – claim 19] a time-masked segment by applying a mask to the audio signal according to the fixed duration and the random time of the respective speech segment, wherein a subsequent layer of the neural network is applied using the one or more time-masked segments.
Park NPL teaches for each speech segment, generating, by the computer/generate, a time-masked segment by applying a mask to the audio signal according to the fixed duration and the random time of the respective speech segment (Park NPL section 2, time masking is applied to time steps (time-masked segment of fixed duration), where t is chosen from a uniform distribution (random time), where the augmentation applied directly to the feature inputs of the neural network for speech recognition (thus the time steps having respective speech segments)), wherein a subsequent layer of the neural network is applied using the one or more time-masked segments (Park NPL Abstract, sections 2, 3.1, 5, the augmentation acts on the log mel spectrogram directly to help the neural network learn useful features (thus during training of the neural network) where the log mel spectrograms are passed in to the network during training, where section 5 discloses that the neural network is trained on augmented data (thus subsequent layers (the neural network) are applied using the time-masked augmented data)).
Therefore, taking the teachings of Cartwright and Park NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the augmentation disclosed in Cartwright to include the time masking as disclosed in Park NPL at least because doing so would improve the word error rate in automatic speech recognition (Park NPL Abstract).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Arik et al., US 2018/0336880 A1, directed towards augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings.
Kisilev, US 2017/0200092 A1, directed towards a classification function trained with augmented data using noise features.
Applicant's amendment necessitated any new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Friday, 09:30-18:30 EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MICHELLE M. KOETH
Primary Examiner
Art Unit 2656



/MICHELLE M KOETH/Primary Examiner, Art Unit 2656