Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on August 22, 2019 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Election/Restriction
Applicant’s election without traverse of claims 1-18 in the reply filed on September 08, 2021 is acknowledged.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5, and 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Fejgin (U.S. Patent No. 20210082444) in view of Kamath Koteshwara (U.S. Patent No. 10127918).
Regarding claim 1, Fejgin teaches a method for training a machine learned model configured to determine one or more characteristics associated with an audio signal, comprising: obtaining an audio signal, the audio signal comprising an unlabeled audio signal ([0042] - The machine learning module 210 may, for example, receive the input audio signal 205 via an 
However, Fejgin does not teach the method for sampling the audio signal to select one or more sampled slices; and receiving, as an output of the machine learned model, one or more determined characteristics associated with the audio signal, the one or more determined characteristics comprising one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices.
Kamath Kateshwara does teach the method for sampling the audio signal to select one or more sampled slices ([Col 2 Row 37] - The audio data may include one or more audio samples); and receiving, as an output of the machine learned model, one or more determined characteristics associated with the audio signal, the one or more determined characteristics comprising one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices ([Col 6 Rows 40-56] – As the quantization processes and/or training data are different between reconstructing missing audio samples and clipped audio samples, a neural network trained to reconstruct missing audio samples will generate a different prediction that a neural network trained to reconstruct clipped audio sample. Thus, the signal reconstructor 330 may include a first neural network trained to generate forward-looking audio data predictions to reconstruct missing audio samples, a second neural network trained to generate forward-looking audio data predictions to reconstruct missing audio samples, a second neural network trained to generate forward-looking audio data predictions to reconstruct clipped audio samples, and a third neural network trained to generate backward-looking audio data predictions to reconstruct the clipped audio samples. The first 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Koteshwara in order to implement a method of sampling specific slices of the audio signal and a method of receiving characteristics of the reconstructed audio signal. Doing so enables the system to generate audio data using only the forward-looking neural network for low latency applications or to generate audio data using both neural networks for mid to high latency applications (Kamath Koteshwara Col 2 Rows 3-8).
	Regarding claim 2, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 1, above.
	However, Fejgin does not teach the method of claim 1, wherein the one or more sampled slices comprise a single sampled slice; and wherein the one or more determined characteristics comprise a reconstructed preceding portion of the audio signal temporally adjacent to the single sampled slice and a reconstructed successive portion of the audio signal temporally adjacent to the single sampled slice.
	Kamath Koteshwara does teach the method of claim 1, wherein the one or more sampled slices comprise a single sampled slice ([Col 2 Row 37] - The audio data may include one or more audio samples); and wherein the one or more determined characteristics comprise a reconstructed preceding portion of the audio signal temporally adjacent to the single sampled slice and a reconstructed successive portion of the audio signal temporally adjacent to the single sampled slice ([Col 6 Rows 40-56] – As the quantization processes and/or training data are different 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Koteshwara in order to implement a method of selecting a single sampled slice and a method wherein the one or more determined characteristics comprise a reconstructed preceding portion of the audio signal temporally adjacent to the single sampled slice. Doing so enables the system to generate audio data using only the forward-looking neural network for low latency applications or to generate audio data using both neural networks for mid to high latency applications (Kamath Koteshwara Col 2 Rows 3-8).
Regarding claim 5, Fejgin in view of Kamath Kateshwara teaches all of the limitations as in claim 1, above.
However, Fejgin does not teach the method of claim 1, wherein the one or more sampled slices comprise at least a first sampled slice and a second sampled slice separated by a temporal 
Kamath Kateshwara does teach the method of claim 1, wherein the one or more sampled slices comprise at least a first sampled slice and a second sampled slice separated by a temporal gap (Col 4 Rows 21-23 – For example, the first missing segment 214 may correspond to a single packet of data being lost while the second missing segment 216 may correspond to two or more packets of data being lost); and wherein the one or more determined characteristics associated with the audio signal comprise a reconstructed portion of the audio signal corresponding to at least a portion of the temporal gap (Col 4 Rows 24-58 – As illustrated in FIG. 2B, audio chart 220 illustrates input audio data 222 that includes a clipped segment 224 and a missing segment 226. The clipped segment 224 corresponds to a series of audio samples in the input audio data 222 having values equal to a saturation threshold associated with the microphone, which occurs when an output of the microphone is saturated due to a loud user utterance and/or a loud environment. The missing segment 226 corresponds to an ideal waveform that would have been captured by the microphone if the microphone where not saturated. In order to reconstruct the input audio data 222, the server(s) 120 may generate reconstructed audio data corresponding to the missing segment 226).
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Kateshwara in order to implement a method wherein the one or more sampled slices comprise at least a first sampled slice and a second sampled slice separated by a temporal gap; and wherein the one or more determined characteristics associated with the audio signal comprise a 
Regarding claim 8, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 5, above.
Fejgin teaches the method of claim 5, wherein the one or more corresponding ground truth characteristics of the audio signal comprise a ground truth portion of the audio signal corresponding to the reconstructed portion of the audio signal ([0044] - According to this example, the loss function generating module 225 receives the input audio signal 205 and uses the input audio signal 205 as the “ground truth” for error determination. However, in some alternative implementations the loss function generating module 225 may receive ground truth data from the optional ground truth module 220. Such implementations may, for example, involve tasks such as speech enhancement or speech denoising, in which the growth truth is not the original input audio signal. Whether the round truth data is the input audio signal 205 or data that is received from the optional ground truth module, the loss function generating module 225 evaluates the output audio signal according to a loss function algorithm and the ground truth data, and provides a loss function value 230 to the machine learning module 210); and wherein the loss function comprises a mean-square error loss function determined based at least in part on a difference between the ground truth portion of the audio signal and the reconstructed portion of the audio signal ([0071] - In some examples, the training process may involve: receiving, by the neural network and via the inter face system, an input training audio signal; generating, by the neural network and based on the input training audio signal, an encoded training audio signal; 
Regarding claim 9, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 1, above.
However, Fejgin does not teach the method of claim 1, wherein the one or more sampled slices comprise a first sampled slice and a second sampled slice separated by a temporal gap; and wherein the one or more determined characteristics comprise an estimated time distance between the first sampled slice and the second sampled slice.
Kamath Koteshwara does teach the method of claim 1, wherein the one or more sampled slices comprise a first sampled slice and a second sampled slice separated by a temporal gap (Col 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Koteshwara in order to implement a method wherein the one or more sampled slices comprise a first sampled slice and a second sampled slice separated by a temporal gap; and wherein the one or more determined characteristics comprise an estimated time distance between the first sampled slice and the second sampled slice. Doing so enables the system to generate audio data using only the forward-looking neural network for low latency applications or to generate audio data using both neural networks for mid to high latency applications (Kamath Koteshwara Col 2 Rows 3-8). 
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Fejgin in view of Kamath Koteshwara, and further in view of Atti (U.S. Patent No. 20180204578).

However, Fejgin in view of Kamath Koteshwara does not teach the method of claim 5, wherein any two successive sampled slices of the one or more sampled slices comprise non-overlapping sampled slices separated by one or more temporal frames to reduce or eliminate leakage between the two successive sampled slices during training.
Atti does teach method of claim 5, wherein any two successive sampled slices of the one or more sampled slices comprise non-overlapping sampled slices separated by one or more temporal frames to reduce or eliminate leakage between the two successive sampled slices during training ([0047] - Some encoders improve temporal alignment of two channels by shifting both channels. For example, a first channel may be causally shifted by half of the mismatch amount, and a second channel may be non-causally shifted by half of the mismatch amount, resulting in a temporal alignment of the two channels. However, proposed systems use only non-causal shifting of one channel to improve temporal alignment of the channels. For example, a target channel (e.g., a lagging channel), can be non-causally shifted in order to align the reference channel and the target channel. Since only the target channel is shifted to temporally align the channels, the target channel is shifted by a larger amount than it would be if both causal and non-causal shifts were used to align the channels. When one channel, i.e., the target channel, is the only channel shifted based on a determined mismatch value, a mid channel and a side channel (obtained from downmixing the first channel and the second channel) will demonstrate an increase in inter harmonic noise or spectral leakage. This inter harmonic noise (e.g., artifacts) is more dominant in the side channel, when window rotation (e.g., the amount of non-causal shift) is quite large (e.g., greater than 1-2 ms)).
.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Fejgin in view of Kamath Koteshwara, further in view of Chang (U.S. Patent No. 20210166705), and further in view of Salvi (U.S. Patent No. 20190035113).
Regarding claim 7, Fejgin in view of Kamath Koteshawara teaches all of the limitations as in claim 5, above.
However, Fejgin in view of Kamath Koteshwara does not teach the method wherein the encoder network comprises a plurality of convolutional layers a max pooling layer, and a fully connected layer; and wherein the decoder network comprises an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order, with the max pooling layer replaced by a nearest-neighbor upsampling layer.
Chang does teach the method of claim 5, wherein the encoder network comprises a plurality of convolutional layers, a max pooling layer, and a fully connected layer ([0014] - The DNN generation model may be the CNN in a structure in which a convolutional layer performing encoding functionality and a deconvolutional layer performing decoding functionality are symmetrically provided. [0075] - which differs from a CNN classification 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara to incorporate the teachings of Chang in order to implement a method wherein the encoder network comprises a plurality of convolutional layers a max pooling layer, and a fully connected layer. Doing so allows for improved speech call quality by extending a narrowband speech signal to a wideband speech signal (Chang [0001). 
However, Fejgin in view of Kamath Koteshwara in view of Chang does not teach the method wherein the decoder network comprises an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order, with the max pooling layer replaced by a nearest-neighbor upsampling layer.
Salvi does teach the method wherein the decoder network comprises an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order, with the max pooling layer replaced by a nearest-neighbor upsampling layer (Figure 2A – Nearest Upsampling Layer. [0046] – In an embodiment, strided convolutions are used instead of pooling. In an embodiment, each stage of the decoder portion of the encoder/decoder neural network model 110 uses a nearest upsampling layer followed by two convolutional layers).
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara in view of Chang to incorporate the teachings of Salvi in order to implement a method wherein the decoder network comprises an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order, with the max pooling layer replaced by a nearest-neighbor .
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Fejgin in view of Kamath Koteshwara in view of Chang, and further in view of Hedayatnia (U.S. Patent No. 11043214).
Regarding claim 10, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 9, above.
However, Fejgin in view of Kamath Koteshwara does not teach the method of claim 9, wherein the encoder network comprises a plurality of convolutional layers; wherein the first sampled slice and the second sampled slice are each input into the encoder network to determine a first embedding representation and a second embedding representation, respectively; wherein the first embedding representation and the second embedding representation are concatenated into a single vector; and wherein the single vector is input into a fully connected feed forward network to obtain a scalar output.
Chang does teach the method of claim 9, wherein the encoder network comprises a plurality of convolutional layers ([0014] - The DNN generation model may be the CNN in a structure in which a convolutional layer performing encoding functionality and a deconvolutional layer performing decoding functionality are symmetrically provided. [0075] - which differs from a CNN classification model generally including a convolutional layer, a pooling layer, and a fully connected layer (FCL)). 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara to 
However, Fejgin in view of Kamath Koteshwara in view of Chang does not teach the method wherein the first sampled slice and the second sampled slice are each input into the encoder network to determine a first embedding representation and a second embedding representation, respectively; wherein the first embedding representation and the second embedding representation are concatenated into a single vector; and wherein the single vector is input into a fully connected feed forward network to obtain a scalar output.
Hedayatnia does teach the method of claim 9, wherein the encoder network comprises a plurality of convolutional layers; wherein the first sampled slice and the second sampled slice are each input into the encoder network to determine a first embedding representation and a second embedding representation, respectively ([Col 5 Rows 19-27] - In another embodiment, the system(s) 120 may determine a first word embedding data vector corresponding to a first word of the system generated response and a second word embedding data vector corresponding to a first word the previous utterance. [Col 38 Rows 36-38] – a neural network may include a number of layers, from input layer 1 1510 through output layer N 1520); wherein the first embedding representation and the second embedding representation are concatenated into a single vector ([Col 4 Rows 17-21] – Thus the first data may include multiple word embedding data vectors (explained further below), where each word embedding data vector represents one word of the pervious utterance); and wherein the single vector is input into a fully connected feed forward network to obtain a scalar output ([Col 28 Rows 1-4] – The context encoder 950 may concatenate the metadata 1116 with the past user utterances 1112 and/or the past system 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara in view of Chang to incorporate the teachings of Hedayatnia in order to implement a method wherein the encoder network comprises a plurality of convolutional layers; wherein the first sampled slice and the second sampled slice are each input into the encoder network to determine a first embedding representation and a second embedding representation, respectively; wherein the first embedding representation and the second embedding representation are concatenated into a single vector; and wherein the single vector is input into a fully connected feed forward network to obtain a scalar output. Doing so allows the audio signal to be converted into text data, which may then be provided to various text-based software applications (Hedayatnia Col 1 Rows 16-18).
Claims 12-17 are rejected under 35 U.S.C. 103 as being unpatentable over Fejgin in view of Kamath Koteshwara, and further in view of Jia (U.S. Patent No. 20210217404).
Regarding claim 12, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 1, above.
However, Fejgin in view of Kamath Koteshwara does not teach the method of claim 1, wherein sampling the audio signal to select one or more sampled slices comprises determining an audio spectrogram for each of the one or more sampled slices; and wherein inputting the one or 	
Jia does teach the method of claim 1, wherein sampling the audio signal to select one or more sampled slices comprises determining an audio spectrogram for each of the one or more sampled slices ([0016] - The spectrogram generation neural network may be a sequence-to- sequence attention neural network that is trained to predict mel spectrograms from a sequence of phoneme or grapheme inputs. The spectrogram generation neural network may optionally include an encoder neural network, an attention layer, and a decoder neural network); and wherein inputting the one or more sampled slices into the machine learned model comprises inputting the respective audio spectrogram for each of the one or more sampled slices into the machine learned model ([0002] - Neural networks are machine learning models that employ multiple layers of operations to predict one or more outputs from one or more inputs. [0005] - The embedding may be computed using an independently-trained speaker encoder network (also referred to herein as a speaker verification neural network, or as a speaker encoder) which encodes an arbitrary length speech spectrogram into a fixed dimensional embedding vector. [0016] - The spectrogram generation neural network may concatenate the speaker embedding vector with outputs of the encoder neural network that are provided as input to the attention layer. [0031] - The spectrogram generation engine 120 may receive input text to synthesize and receive the speaker vector determined by the speaker encoder engine 110 and, in response, generate an audio representation of speech of that input text in a voice of a target speaker. [0058] - In some implementations, the spectrogram generation neural network concatenates the speaker embedding vector with outputs of the encoder neural network that are provided as input to the attention layer).

Regarding claim 13, Fejgin in view of Kamath Koteshwara in view of Jia teaches all of the limitations as in claim 12, above.
However, Fejgin in view of Kamath Koteshwara does not teach the method of claim 12, wherein the one or more determined characteristics comprise the one or more portions of the audio signal; wherein receiving, as an output of the machine learned model, the one or more determined characteristics associated with the audio signal, comprises receiving, as an output of the machine learned model, a respective reconstructed audio spectrogram for each of the one or more sampled slices; and wherein the method further comprises determining a respective reconstructed portion of the audio signal for each of the reconstructed audio spectrograms.
Jia does teach the method of claim 12, wherein the one or more determined characteristics comprise the one or more portions of the audio signal; wherein receiving, as an output of the machine learned model, the one or more determined characteristics associated with the audio signal, comprises receiving, as an output of the machine learned model, a respective 

Regarding claim 14, Fejgin in view of Kamath Koteshwara teaches all of the limitations as in claim 1, above.
Fejgin teaches the method of claim 1, wherein the machine-learned model comprises a multi-head machine-learned model comprising the encoder network and a plurality of decoder networks; wherein each decoder network is configured to perform a different auxiliary task ([0042] - According to some examples, the elements of the system 200, including but not limited to the machine learning module 210, may be implemented via one or more control systems such as the control system 115. The machine learning module 210 may, for example, receive the input audio signal 205 via an interface system such as the interface system 110. In some instances, the machine learning module 210 may be configured to implement one or more neural networks, such as the neural networks disclosed herein. [0050] - Here, a first portion of the neural network 
However, Fejgin in view of Kamath Koteshwara does not teach the method wherein the one or more sampled slices are input into the encoder network to obtain one or more embeddings; wherein the one or more embeddings are input into each decoder network to obtain one or more respective determined characteristics associated with the audio signal for each different auxiliary task.
Jia does teach the method wherein the one or more sampled slices are input into the encoder network to obtain one or more embeddings ([0039] - The LSTM speaker encoder maps a sequence of mel spectrogram frames computed from a speech utterance of arbitrary length, to a fixed-dimensional embedding vector, known as a d-vector or speaker vector); wherein the one or more embeddings are input into each decoder network to obtain one or more respective determined characteristics associated with the audio signal for each different auxiliary task ([0038] - The LSTM speaker encoder is used to condition the synthesis network on a reference speech signal from the desired target speaker. Good generalization can be achieved using a reference speech signal which captures the characteristics of different speakers. Good generalization can lead to the identification of these characteristics using only a short adaptation signal, independent of its phonetic content and background noise. These objectives are satisfied using a speaker-discriminative model trained on a text-independent speaker verification task. The LSTM speaker encoder may be a speaker-discriminative audio embedding network, which is not limited to a closed set of speakers. [0045] - FIG . 2 is a block diagram of an example system 200 during training to synthesize speech . The example system 200 includes a speaker encoder 210, a synthesizer 220, and a vocoder 230. The synthesizer 220 includes a text encoder 222, an 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara to incorporate the teachings of Jia in order to implement the method wherein the one or more sampled slices are input into the encoder network to obtain one or more embeddings; the one or more sampled slices are input into the encoder network to obtain one or more embeddings. Doing so allows a system to be able to leverage the knowledge of speaker variability learned by the speaker encoder in order to generalize well and synthesize natural speech from speakers that were never seen during training using only a few seconds of audio from each one (Jia [0005]).
Regarding claim 15, Fejgin teaches a computing system, comprising: at least one processor ([0042] - According to some examples, the elements of the system 200, including but not limited to the machine learning module 210, may be implemented via one or more control systems such as the control system 115. The machine learning module 210 may, for example, receive the input audio signal 205 via an interface system such as the interface system 110); a machine learned audio reconstruction model comprising at least one tangible, non-transitory 
However, Fejgin does not teach the computing system comprising: a machine learned audio reconstruction model comprising an encoder network, the encoder network comprising a plurality of convolutional layers, wherein the encoder network is trained to receive one or more sampled slices of an audio signal and output a respective embedding for each of the one or more sampled slices of the audio signal; and a decoder network, the decoder network comprising an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order; wherein the decoder network is trained to receive the respective embedding for each of the one or more sampled slices of the audio signal and output one or more reconstructed portions of the audio signal; selecting the one or more sampled slices of the audio signal; inputting the one or more sampled slices of the audio signal into the encoder network of the machine learned model; wherein the one or more reconstructed portions of the audio signal correspond to one or more portions of the audio signal temporally adjacent to the one or more sampled slices of the audio signal; and receiving, as an output of the encoder network, the respective embedding for each of the one or more sampled slices of the audio signal; and inputting the respective embedding for each of the one or more sampled slices of the audio signal into the decoder network of the machine learned model; and receiving, as an output of the decoder network, the one or more reconstructed portions of the audio signal.
Kamath Koteshwara does teach the computing system comprising selecting the one or more sampled slices of the audio signal ([Col 2 Row 37] – The audio data may include one or more audio samples); inputting the one or more sampled slices of the audio signal into the encoder network ([Col 6 Rows 6-12] – Thus, the quantizer 320 may perform (324) nonuniform 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Koteshwara in order to implement the computing system comprising: selecting the one or more sampled slices of the audio signal; inputting the one or more sampled slices of the audio signal into the encoder network of the machine learned model; wherein the one or more reconstructed portions of the audio signal correspond to one or more portions of the audio signal temporally 
However, Fejgin in view of Kamath Koteshwara does not teach the computing system comprising: a machine learned audio reconstruction model comprising an encoder network, the encoder network comprising a plurality of convolutional layers, wherein the encoder network is trained to receive one or more sampled slices of an audio signal and output a respective embedding for each of the one or more sampled slices of the audio signal; and a decoder network, the decoder network comprising an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order; wherein the decoder network is trained to receive the respective embedding for each of the one or more sampled slices of the audio signal and output one or more reconstructed portions of the audio signal; and receiving, as an output of the encoder network, the respective embedding for each of the one or more sampled slices of the audio signal; and inputting the respective embedding for each of the one or more sampled slices of the audio signal into the decoder network of the machine learned model; and receiving, as an output of the decoder network, the one or more reconstructed portions of the audio signal.
Jia does teach the computing system comprising: a machine learned audio reconstruction model comprising an encoder network, the encoder network comprising a plurality of convolutional layers, wherein the encoder network is trained to receive one or more sampled slices of an audio signal and output a respective embedding for each of the one or more sampled slices of the audio signal ([0002] - Neural networks are machine learning models that employ multiple layers of operations to predict one or more outputs from one or more inputs. [0039] - 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of Kamath Koteshwara to incorporate the teachings of Jia in order to implement the computing system comprising: a machine learned audio reconstruction model comprising an encoder network, the encoder network comprising a plurality of convolutional layers, wherein the encoder network is trained to receive one or more sampled slices of an audio signal and output a respective embedding for each of the one or more sampled slices of the audio signal; and a decoder network, the decoder network comprising an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order; wherein the decoder network is trained to receive the respective embedding for each of the one or more sampled slices of the audio signal and output one or more reconstructed portions of the audio signal; and receiving, as an output of the encoder network, the respective embedding for each of the one or more sampled slices of the audio signal; and inputting the respective embedding for each of the one or more sampled slices of the audio signal into the decoder network of the machine learned model; and receiving, as an output of the decoder network, the one or more reconstructed portions of the audio signal. Doing so allows a 
Regarding claim 16, Fejgin in view of Kamath Koteshwara in view of Jia teaches all of the limitations as in claim 15, above.
Kamath Koteshwara teaches the computing system of claim 15, wherein the one or more sampled slices comprise a single sampled slice ([Col 2 Row 37] – The audio data may include one or more audio samples); and wherein the one or more reconstructed portions of the audio signal comprise a reconstructed preceding portion of the audio signal temporally adjacent to the single sampled slice and a reconstructed successive portion of the audio signal temporally adjacent to the single sampled slice ([Col 6 Rows 40-56] – As the quantization processes and/or training data are different between reconstructing missing audio samples and clipped audio samples, a neural network trained to reconstruct missing audio samples will generate a different prediction that a neural network trained to reconstruct clipped audio sample. Thus, the signal reconstructor 330 may include a first neural network trained to generate forward-looking audio data predictions to reconstruct missing audio samples, a second neural network trained to generate forward-looking audio data predictions to reconstruct missing audio samples, a second neural network trained to generate forward-looking audio data predictions to reconstruct clipped audio samples, and a third neural network trained to generate backward-looking audio data predictions to reconstruct the clipped audio samples. The first neural network and the second neural network may include identical components, but due to the differences in quantization processes and training data, the first neural network may generate different predictions that the second neural network). 

Regarding claim 17, Fejgin in view of Kamath Koteshwara in view of Jia teaches all of the limitations as in claim 15, above.
Kamath Koteshwara teaches the computing system of claim 15, wherein the one or more sampled slices comprise at least a first sampled slice and a second sampled slice separated by a temporal gap ([Col 2 Row 37] – The audio data may include one or more audio samples); and wherein the one or more reconstructed portions of the audio signal comprise a reconstructed portion of the audio signal corresponding to at least a portion of the temporal gap (Col 4 Rows 24-58 – As illustrated in FIG. 2B, audio chart 220 illustrates input audio data 222 that includes a clipped segment 224 and a missing segment 226. The clipped segment 224 corresponds to a series of audio samples in the input audio data 222 having values equal to a saturation threshold associated with the microphone, which occurs when an output of the microphone is saturated due to a loud user utterance and/or a loud environment. The missing segment 226 corresponds to an ideal waveform that would have been captured by the microphone if the microphone where not 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin to incorporate the teachings of Kamath Koteshwara in order to implement the computing system, wherein the one or more sampled slices comprise at least a first sampled slice and a second sampled slice separated by a temporal gap; and wherein the one or more reconstructed portions of the audio signal comprise a reconstructed portion of the audio signal corresponding to at least a portion of the temporal gap. Doing so enables the system to generate audio data using only the forward-looking neural network for low latency applications or to generate audio data using both neural networks for mid to high latency applications (Kamath Koteshwara Col 2 Rows 3-8).
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Fejgin in view of Kamath Koteshwara in view of Jia, and further in view of Atti.
Regarding claim 18, Fejgin in view of Kamath Koteshwara in view of Jia teaches all of the limitations as in claim 15, above.
However, Fejgin in view of Kamath Koteshwara in view of Jia does not teach the computing system of claim 15, wherein the computing system comprises a mobile computing device.
Atti does teach the computing system of claim 15, wherein the computing system comprises a mobile computing device ([0118] - The device 600 may include a wireless telephone, a mobile communication device, a mobile phone, a smart phone, a cellular phone, a laptop computer, a desktop computer, a computer, a tablet computer, a set top box, a personal digital assistant (PDA), a display device, a television, a gaming console, a music player, a radio, 
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have modified Fejgin in view of King in view of Lecomte to incorporate the teachings of Atti in order to implement the computing system that comprises a mobile computing device. Doing so allows the device that be small, lightweight, and easily carried by user (Atti [0003]).
Allowable Subject Matter
Claims 3, 4, and 11 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 
Regarding claim 3, the following is a statement of reasons for the indication of allowable subject matter:  The prior art could not overcome or render obvious a decoder network in which the last convolutional layer has twice as many output channels as claimed. 
The closest prior art, Chang (U.S. Patent No. 20210166705) fails to anticipate or render obvious the above described limitations. Chang does teach the method of claim 2, wherein the encoder network comprises a plurality of convolutional layers, a max pooling layer, and a fully connected layer; wherein the decoder network comprises an identical copy of at least a subset of the plurality of convolutional layers arranged in a reverse order.
Regarding claim 4, the following is a statement of reasons for the indication of allowable subject matter:  The prior art could not overcome or render obvious an average mean square error loss function determined based at least in part on a difference between the first ground truth 
The closest prior art, Fejgin (U.S. Patent No. 20210082444) fails to anticipate or render obvious the above described limitations. Fejgin does teach the method of claim 2, wherein the one or more corresponding ground truth characteristics of the audio signal comprise a first ground truth portion of the audio signal corresponding to the reconstructed preceding portion of the audio signal and a second ground truth portion of the audio signal corresponding to the reconstructed successive portion of the audio signal.
Regarding claim 11, the following is a statement of reasons for the indication of allowable subject matter:  The prior art could not overcome or render obvious a loss function that comprises a cross-entropy loss between the ground truth temporal gap and the estimated time distance between the first sampled slice and the second sampled slice as claimed. 
The closest prior art, Park (U.S. Patent No. 20210335381) fails to anticipate or render obvious the above described limitations. Park does teach the method of claim 9, wherein the loss function comprises a cross-entropy loss, but it does not disclose that the function is between a ground truth temporal gap and the estimated time distance between the first sampled slice and the second sampled slice.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Sung (U.S. Patent No. 20200234720) teaches an audio reconstruction method and device which use machine learning. Thagadur Shivappa (U.S. Patent No. 9905233) teaches methods and 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ETHAN DANIEL KIM whose telephone number is (571) 272-1405.  The examiner can normally be reached on Monday - Friday 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 
/ETHAN DANIEL KIM/
Examiner, Art Unit 2659   

/RICHEMOND DORVIL/            Supervisory Patent Examiner, Art Unit 2658