DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 2/18/2021. Claims 1-21 are pending in this application. As such, claims 1-21 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-4, 6-12, and 13-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Cartwright et al. (US 20210035563 A1) (Further referred to as “Cartwright”).

Regarding Claim 1, Cartwright teaches a computer-implemented method for data augmentation, executed on a computing device, comprising: defining a model representative of a plurality of acoustic variations to a speech signal, thus defining a plurality of time-varying spectral modifications (Cartwright Paragraph 180 - Variable spectrum semi-stationary noise: For example, select a random SNR (as for the fixed spectrum stationary noise example), and also select a random stationary noise spectrum from a distribution (for example, a distribution of linear slope values in dB/octave, or a distribution over DCT values of the log mel spectrum (cepstral)). Then, apply the noise at the chosen level (determined by the selected SNR value) with the selected shape. In some embodiments, the shape of the noise is varied slowly over time by, for example, choosing a rate of change for each cepstral value per second and using that to modulate the shape of the noise being applied (e.g., during one epoch, or in between performance of successive epochs). An example of variable spectrum semi stationary noise augmentation will be described with reference to FIG. 6.);
and applying the plurality of time-varying spectral modifications to a reference signal using a filtering operation, thus generating a time-varying spectrally- augmented signal (Cartwright Paragraph 201 - Another embodiment of the invention, which includes microphone equalization augmentation, will be described with reference to FIG. 5. In the FIG. 5 example, training data (e.g., features 111B of FIG. 1B) are augmented (e.g., by function/unit 103B of FIG. 1B) by applying thereto, during each epoch of a training loop, a filter (e.g., a different filter for each different epoch) having a randomly chosen linear magnitude response. The characteristics of the filter (for each epoch) are determined from a randomly chosen microphone tilt (e.g., a tilt, in dB/octave, chosen from a normal distribution of microphone tilts),).

Regarding Claim 2, Cartwright teaches all of the limitations of claim 1. Cartwright also teaches that defining a model representative of a plurality of acoustic variations to the speech signal associated with a change in a relative position of a speaker and a microphone (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)), 
and defining a model representative of a plurality of acoustic variations to the speech signal associated with adaptive beamforming (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)).

Regarding Claim 3, Cartwright teaches all of the limitations of claim 1. Cartwright also teaches that modeling the plurality of acoustic variations to the speech signal as a statistical distribution (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
modeling the plurality of acoustic variations to the speech signal as a mathematical model representative of the plurality of acoustic variations associated with a particular use-case scenario (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
and generating, via a machine learning model, a mapping of the plurality of acoustic variations to one or more feature coefficients of a target domain (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]).

Regarding Claim 4, Cartwright teaches all of the limitations of claim 1. Cartwright also teaches that defining the model representative of the plurality of acoustic variations to the speech signal includes receiving one or more inputs associated with one or more of speaker location and speaker orientation (Cartwright Paragraph 81 - Typically, training data are gathered (e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered).

Regarding Claim 6, Cartwright teaches all of the limitations of claim 4. Cartwright also teaches training a speech processing system using the time-varying spectrally- augmented signal and the one or more inputs associated with one or more of speaker location and speaker orientation (Cartwright Paragraphs 81 and 129 - 104A: Feature extraction function/unit. This function (or unit) takes as input the augmented training data 112A (e.g., time domain PCM audio data) and extracts therefrom features 113A (e.g., Mel Frequency Cepstral Coefficients (MFCC), "logmelspec" (logarithm of powers of bands spaced to occupy equal or substantially equal parts of the Mel spectrum) coefficients, coefficients which are indicative of powers of bands spaced to occupy at least roughly equal parts of the log spectrum, and/or Perceptual Linear Predictor (PLP) coefficients) for training the model 114A. The PLP helps figure out the speaker’s location and orientation.).

Regarding Claim 7, Cartwright teaches all of the limitations of claim 1. Cartwright also teaches training a speech processing system using the time-varying spectrally- augmented signal, thus defining a trained speech processing system (Cartwright Paragraphs 81 and 129 - 104A: Feature extraction function/unit. This function (or unit) takes as input the augmented training data 112A (e.g., time domain PCM audio data) and extracts therefrom features 113A (e.g., Mel Frequency Cepstral Coefficients (MFCC), "logmelspec" (logarithm of powers of bands spaced to occupy equal or substantially equal parts of the Mel spectrum) coefficients, coefficients which are indicative of powers of bands spaced to occupy at least roughly equal parts of the log spectrum, and/or Perceptual Linear Predictor (PLP) coefficients) for training the model 114A. This system is using a time-varying spectrally-augmented signal).

Regarding Claim 8, Cartwright teaches all of the limitations of claim 7. Cartwright also teaches performing speech processing via the trained speech processing system, wherein the trained speech processing system is executed on at least one computing device (Cartwright Paragraph 116 and Figure 3A - FIG. 3A is a block diagram that shows examples of components of an apparatus (5) that may be configured to perform at least some of the methods disclosed herein. In some examples, apparatus 5 may be or may include a personal computer, a desktop computer, a graphics processing unit (GPU), or another local device that is configured to provide audio processing).

Regarding Claim 9, Cartwright teaches a computer program product residing on a non-transitory computer readable medium having a plurality of instructions stored thereon which, when executed by a processor, cause the processor to perform operations comprising (Cartwright Paragraph 23 - Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method or steps thereof, and a tangible, non-transitory, computer readable medium which implements non-transitory storage of data (for example, a disc or other tangible storage medium) which stores code for performing (e.g., code executable to perform) any embodiment of the inventive method or steps thereof.): 
defining a model representative of a plurality of acoustic variations to a speech signal, thus defining a plurality of time-varying spectral modifications (Cartwright Paragraph 180 - Variable spectrum semi-stationary noise: For example, select a random SNR (as for the fixed spectrum stationary noise example), and also select a random stationary noise spectrum from a distribution (for example, a distribution of linear slope values in dB/octave, or a distribution over DCT values of the log mel spectrum (cepstral)). Then, apply the noise at the chosen level (determined by the selected SNR value) with the selected shape. In some embodiments, the shape of the noise is varied slowly over time by, for example, choosing a rate of change for each cepstral value per second and using that to modulate the shape of the noise being applied (e.g., during one epoch, or in between performance of successive epochs). An example of variable spectrum semi stationary noise augmentation will be described with reference to FIG. 6.);
and applying the plurality of time-varying spectral modifications to a reference signal using a filtering operation, thus generating a time-varying spectrally- augmented signal (Cartwright Paragraph 201 - Another embodiment of the invention, which includes microphone equalization augmentation, will be described with reference to FIG. 5. In the FIG. 5 example, training data (e.g., features 111B of FIG. 1B) are augmented (e.g., by function/unit 103B of FIG. 1B) by applying thereto, during each epoch of a training loop, a filter (e.g., a different filter for each different epoch) having a randomly chosen linear magnitude response. The characteristics of the filter (for each epoch) are determined from a randomly chosen microphone tilt (e.g., a tilt, in dB/octave, chosen from a normal distribution of microphone tilts),).

Regarding Claim 10, Cartwright teaches all of the limitations of claim 9. Cartwright also teaches that defining a model representative of a plurality of acoustic variations to the speech signal associated with a change in a relative position of a speaker and a microphone (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)), 
and defining a model representative of a plurality of acoustic variations to the speech signal associated with adaptive beamforming (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)).

Regarding Claim 11, Cartwright teaches all of the limitations of claim 9. Cartwright also teaches that modeling the plurality of acoustic variations to the speech signal as a statistical distribution (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
modeling the plurality of acoustic variations to the speech signal as a mathematical model representative of the plurality of acoustic variations associated with a particular use-case scenario (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
and generating, via a machine learning model, a mapping of the plurality of acoustic variations to one or more feature coefficients of a target domain (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]).

Regarding Claim 12, Cartwright teaches all of the limitations of claim 9. Cartwright also teaches that defining the model representative of the plurality of acoustic variations to the speech signal includes receiving one or more inputs associated with one or more of speaker location and speaker orientation (Cartwright Paragraph 81 - Typically, training data are gathered (e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered).

Regarding Claim 14, Cartwright teaches all of the limitations of claim 12. Cartwright also teaches training a speech processing system using the time-varying spectrally- augmented signal and the one or more inputs associated with one or more of speaker location and speaker orientation (Cartwright Paragraphs 81 and 129 - 104A: Feature extraction function/unit. This function (or unit) takes as input the augmented training data 112A (e.g., time domain PCM audio data) and extracts therefrom features 113A (e.g., Mel Frequency Cepstral Coefficients (MFCC), "logmelspec" (logarithm of powers of bands spaced to occupy equal or substantially equal parts of the Mel spectrum) coefficients, coefficients which are indicative of powers of bands spaced to occupy at least roughly equal parts of the log spectrum, and/or Perceptual Linear Predictor (PLP) coefficients) for training the model 114A. The PLP helps figure out the speaker’s location and orientation.).

Regarding Claim 15, Cartwright teaches all of the limitations of claim 9. Cartwright also teaches training a speech processing system using the time-varying spectrally- augmented signal, thus defining a trained speech processing system (Cartwright Paragraphs 81 and 129 - 104A: Feature extraction function/unit. This function (or unit) takes as input the augmented training data 112A (e.g., time domain PCM audio data) and extracts therefrom features 113A (e.g., Mel Frequency Cepstral Coefficients (MFCC), "logmelspec" (logarithm of powers of bands spaced to occupy equal or substantially equal parts of the Mel spectrum) coefficients, coefficients which are indicative of powers of bands spaced to occupy at least roughly equal parts of the log spectrum, and/or Perceptual Linear Predictor (PLP) coefficients) for training the model 114A. This system is using a time-varying spectrally-augmented signal).

Regarding Claim 16, Cartwright teaches all of the limitations of claim 15. Cartwright also teaches performing speech processing via the trained speech processing system, wherein the trained speech processing system is executed on at least one computing device (Cartwright Paragraph 116 and Figure 3A - FIG. 3A is a block diagram that shows examples of components of an apparatus (5) that may be configured to perform at least some of the methods disclosed herein. In some examples, apparatus 5 may be or may include a personal computer, a desktop computer, a graphics processing unit (GPU), or another local device that is configured to provide audio processing).

Regarding Claim 17, Cartwright teaches A computing system comprising: a memory (Cartwright Paragraph 23 - Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method or steps thereof, and a tangible, non-transitory, computer readable medium which implements non-transitory storage of data (for example, a disc or other tangible storage medium) which stores code for performing (e.g., code executable to perform) any embodiment of the inventive method or steps thereof.): 
and a processor configured to (Cartwright Paragraph 23 - For example, embodiments of the inventive system can be or include a programmable general purpose processor, digital signal processor, GPU, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.)
define a model representative of a plurality of acoustic variations to a speech signal, thus defining a plurality of time-varying spectral modifications (Cartwright Paragraph 180 - Variable spectrum semi-stationary noise: For example, select a random SNR (as for the fixed spectrum stationary noise example), and also select a random stationary noise spectrum from a distribution (for example, a distribution of linear slope values in dB/octave, or a distribution over DCT values of the log mel spectrum (cepstral)). Then, apply the noise at the chosen level (determined by the selected SNR value) with the selected shape. In some embodiments, the shape of the noise is varied slowly over time by, for example, choosing a rate of change for each cepstral value per second and using that to modulate the shape of the noise being applied (e.g., during one epoch, or in between performance of successive epochs). An example of variable spectrum semi stationary noise augmentation will be described with reference to FIG. 6.)
and wherein the processor is further configured to apply the plurality of time-varying spectral modifications to a reference signal using a filtering operation, thus generating a time-varying spectrally-augmented signal (Cartwright Paragraph 201 - Another embodiment of the invention, which includes microphone equalization augmentation, will be described with reference to FIG. 5. In the FIG. 5 example, training data (e.g., features 111B of FIG. 1B) are augmented (e.g., by function/unit 103B of FIG. 1B) by applying thereto, during each epoch of a training loop, a filter (e.g., a different filter for each different epoch) having a randomly chosen linear magnitude response. The characteristics of the filter (for each epoch) are determined from a randomly chosen microphone tilt (e.g., a tilt, in dB/octave, chosen from a normal distribution of microphone tilts),).

Regarding Claim 18, Cartwright teaches all of the limitations of claim 17. Cartwright also teaches that defining a model representative of a plurality of acoustic variations to the speech signal associated with a change in a relative position of a speaker and a microphone (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)), 
and defining a model representative of a plurality of acoustic variations to the speech signal associated with adaptive beamforming (Cartwright Paragraphs 81-86 and 108 - Typically, training data are gathered  e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered," para [0081]-[0086]; "The goal of predicting the acoustic zone (in which the user is located) may be to inform a microphone selection or adaptive beamforming scheme that attempts to pick up sound from the acoustic zone of the user more effectively, for example, in order to better recognize a command that follows the wake word,” para (0108)).

Regarding Claim 19, Cartwright teaches all of the limitations of claim 17. Cartwright also teaches that modeling the plurality of acoustic variations to the speech signal as a statistical distribution (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
modeling the plurality of acoustic variations to the speech signal as a mathematical model representative of the plurality of acoustic variations associated with a particular use-case scenario (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]);
and generating, via a machine learning model, a mapping of the plurality of acoustic variations to one or more feature coefficients of a target domain (Cartwright Paragraphs 178 and 179 - Examples of types of augmentations that may be applied (e.g., by augmentation function 103B of FIG. 1B on training data features) in accordance with embodiments of the invention include (but are not limited to) the following...Fixed spectrum stationary noise: For example, for each utterance in a training set (e.g., each utterance of or indicated by training set 110, and thus each utterance of or indicated by feature set 1118), select a random SNR from a distribution (e.g., normal distribution with mean 45 dB, and standard deviation 10 dB) of SNR values, para [0178]-[0179]).

Regarding Claim 20, Cartwright teaches all of the limitations of claim 17. Cartwright also teaches that defining the model representative of the plurality of acoustic variations to the speech signal includes receiving one or more inputs associated with one or more of speaker location and speaker orientation (Cartwright Paragraph 81 - Typically, training data are gathered (e.g., for each zone) by having the user utter the wake word in the vicinity of the intended zone, for example at the center and extreme edges of a couch. Utterances may be repeated several times. The user then moves to the next zone and continues until all zones have been covered).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 5, 13, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cartwright in view of Fernandez et al. (US 20190272818 A1) (Further referred to as “Fernandez”).

Regarding Claim 5, Cartwright teaches all of the limitations of claim 1. Fernandez further teaches that applying the plurality of time-varying spectral modifications to the reference signal using the filtering operation includes one or more of: applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying parameters in time domain filtering (Fernandez Paragraph 103 - In one example, the process and computer program start at block 1100 and thereafter proceeds to block 1102. Block 1102 illustrates reconstructing the audio signal from the parametric representation by stacking together consecutive voiced frames to form contiguous voiced regions. Next, a phase is performed to synthesize as voiced region, as illustrated at reference numeral 1104. In the phase synthesis phase, block 1106 illustrates generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Thereafter, block 1114 illustrates performing time-varying filtering of the glottal source with auto-regressive vocal tract coefficients. Next, block 1116 illustrates interleaving voiced and unvoiced regions using an overall-add procedure, and the process ends.); 
and applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying multiplication factors in frequency domain filtering (Fernandez Paragraph 48 - In one example, reconstruction controller 412 may resynthesize contiguous voiced frames together to form contiguous voiced regions. Reconstruction controller 412 may interleave the voiced regions with the unvoiced regions, which have been preserved in raw sample form. First, reconstruction controller 412 may synthesize a voiced region for selected prosodic features 420, such as pitch, by generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. In one example, reconstruction controller 412 may next generate a sequence of glottal pulse cycles, scaled by a gain factor, and with added aspiration noise to generate the glottal-source and vocal-tract parameters associated with each pitch cycle by interpolating between the corresponding parameters associated with the cycle's surrounding edge frames. In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Finally, restriction controller 412 may synthesize a voiced region by next interleaving voice and unvoiced regions using an overall-add procedure.).
Cartwright and Fernandez are both considered to be analogous to the claimed invention because both are directed to systems and methods for training a machine learning model for speech detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning speech recognition system of Cartwright with the support for applying time-varying factors using multiplication factors of  Fernandez because it would allow for perturbing audio data for machine learning. (Fernandez Paragraphs 48 and 103 - In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients.).

Regarding Claim 13, Cartwright teaches all of the limitations of claim 9. Fernandez further teaches that applying the plurality of time-varying spectral modifications to the reference signal using the filtering operation includes one or more of: applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying parameters in time domain filtering (Fernandez Paragraph 103 - In one example, the process and computer program start at block 1100 and thereafter proceeds to block 1102. Block 1102 illustrates reconstructing the audio signal from the parametric representation by stacking together consecutive voiced frames to form contiguous voiced regions. Next, a phase is performed to synthesize as voiced region, as illustrated at reference numeral 1104. In the phase synthesis phase, block 1106 illustrates generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Thereafter, block 1114 illustrates performing time-varying filtering of the glottal source with auto-regressive vocal tract coefficients. Next, block 1116 illustrates interleaving voiced and unvoiced regions using an overall-add procedure, and the process ends.); 
and applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying multiplication factors in frequency domain filtering (Fernandez Paragraph 48 - In one example, reconstruction controller 412 may resynthesize contiguous voiced frames together to form contiguous voiced regions. Reconstruction controller 412 may interleave the voiced regions with the unvoiced regions, which have been preserved in raw sample form. First, reconstruction controller 412 may synthesize a voiced region for selected prosodic features 420, such as pitch, by generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. In one example, reconstruction controller 412 may next generate a sequence of glottal pulse cycles, scaled by a gain factor, and with added aspiration noise to generate the glottal-source and vocal-tract parameters associated with each pitch cycle by interpolating between the corresponding parameters associated with the cycle's surrounding edge frames. In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Finally, restriction controller 412 may synthesize a voiced region by next interleaving voice and unvoiced regions using an overall-add procedure.).
Cartwright and Fernandez are both considered to be analogous to the claimed invention because both are directed to systems and methods for training a machine learning model for speech detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning speech recognition system of Cartwright with the support for applying time-varying factors using multiplication factors of  Fernandez because it would allow for perturbing audio data for machine learning. (Fernandez Paragraphs 48 and 103 - In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients.).

Regarding Claim 21, Cartwright teaches all of the limitations of claim 17. Fernandez further teaches that applying the plurality of time-varying spectral modifications to the reference signal using the filtering operation includes one or more of: applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying parameters in time domain filtering (Fernandez Paragraph 103 - In one example, the process and computer program start at block 1100 and thereafter proceeds to block 1102. Block 1102 illustrates reconstructing the audio signal from the parametric representation by stacking together consecutive voiced frames to form contiguous voiced regions. Next, a phase is performed to synthesize as voiced region, as illustrated at reference numeral 1104. In the phase synthesis phase, block 1106 illustrates generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Thereafter, block 1114 illustrates performing time-varying filtering of the glottal source with auto-regressive vocal tract coefficients. Next, block 1116 illustrates interleaving voiced and unvoiced regions using an overall-add procedure, and the process ends.); 
and applying the plurality of time-varying spectral modifications to the reference signal using a plurality of time-varying multiplication factors in frequency domain filtering (Fernandez Paragraph 48 - In one example, reconstruction controller 412 may resynthesize contiguous voiced frames together to form contiguous voiced regions. Reconstruction controller 412 may interleave the voiced regions with the unvoiced regions, which have been preserved in raw sample form. First, reconstruction controller 412 may synthesize a voiced region for selected prosodic features 420, such as pitch, by generating a sequence of consecutive pitch cycle onsets according to a desired synthesis pitch contour. In one example, reconstruction controller 412 may next generate a sequence of glottal pulse cycles, scaled by a gain factor, and with added aspiration noise to generate the glottal-source and vocal-tract parameters associated with each pitch cycle by interpolating between the corresponding parameters associated with the cycle's surrounding edge frames. In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Finally, restriction controller 412 may synthesize a voiced region by next interleaving voice and unvoiced regions using an overall-add procedure.).
Cartwright and Fernandez are both considered to be analogous to the claimed invention because both are directed to systems and methods for training a machine learning model for speech detection. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning speech recognition system of Cartwright with the support for applying time-varying factors using multiplication factors of  Fernandez because it would allow for perturbing audio data for machine learning. (Fernandez Paragraphs 48 and 103 - In addition, restriction controller 412 may then synthesize a voiced region by next generating a sequence of glottal pulses with each pulse multiplied by its corresponding gain factor. Restriction controller 412 may optionally synthesize a voiced region by next adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500-Hz-high-passed Gaussian noise signal. Restriction controller 412 may synthesize a voiced region by next converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients. Next, block 1108 illustrates generating a sequence of glottal pulses, with each pulse multiplied by its corresponding gain factor. Thereafter, block 1110 illustrates adding aspiration noise constructed for the entire voiced region by amplitude modulation of a 500 Hz high passed Gaussian noise signal. Next, block 1112 illustrates converting the LSF parameters associated with each pitch cycle into auto-regression filter coefficients.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Every et al. (US 9830899 B1) and Every et al. (US 9502048 B2).
Every et al. (US 9830899 B1) discloses “systems and methods for controlling adaptivity of noise cancellation” (Every – Abstract).
Every et al. (US 9502048 B2) discloses “adaptive noise reduction of an acoustic signal using a sophisticated level of control to balance the tradeoff between speech loss distortion and noise reduction” (Every – Abstract).
Please, see additional references in form PTO-892 for more details.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656   

/BHAVESH M MEHTA/               Supervisory Patent Examiner, Art Unit 2656