DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Oath/Declaration
Examiner notes that Applicant has not submitted an Oath/Declaration.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2022-05-12 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2022-09-01 has been entered. The status of the claims is as follows:
Claims 1-20 remain pending in the application.
Claims 1, 8, and 14 are amended.
Response to Arguments
Applicant's arguments in response to rejections under 35 USC 103 have been fully considered but they are not persuasive. 
Applicant argues on Remarks Page 12 that “In particular, Bowman does not disclose a separate transition network for generating a prior distribution by inputting a value generated from a previous latent distribution.”  In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  In this case, Applicant has argued that Bowman does not teach a separate transition network.  Examiner has not claimed that Bowman teaches this, but instead this was taught by Menick in a combination of references.
Applicant argues on Remarks Pages 13-14 that “Menick does not use an ordered sequence of observations to train parameters” and “The code c for the neighboring datum is randomly picked from the K nearest neighbors to c (the code for x) c, rather than from a previous latent distribution for a previous observation directly preceding the current observation in the sequence. In other words, c is not picked from a latent distribution generated from the data of the previous iteration of the training process.”  In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  Examiner points out that Bowman does teach an ordered sequence of observations to train parameters, and does generate a latent distribution based on directly preceding observations, as Bowman teaches predicting missing words in the current sentences based on the latent distributions of the immediately preceding sentences.  See Bowman, Abstract:  “By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences”, as well as Page 5 Section 5:  “Page 5 Section 5: “We claim that the our VAE's global sentence features make it especially well suited to the task of imputing missing words in otherwise known sentences.”
Furthermore, Menick discloses data “from the preceding time step” going into the transition network in [0036]:  “determining a code for the observation of the preceding time step based on the parameters of the data-conditional probability distribution; providing the code as input to a prior neural network” and [0071]:  “Rather than having a constant prior probability distribution over the latent space, the prior neural network can generate a prior probability distribution that models the code for a given observation conditioned on the code for the preceding observation in the ordering.”
Nevertheless, the arguments against Menick are moot.  While, as stated above, Examiner believes the combination of Bowman and Menick still reads on the amended claims, Examiner believes that the Chung reference recited below is even stronger than Menick, and thus in the interest of compact prosecution, now relies on the combination of Bowman and Chung in this renewed round of Continued Examination.

Claim Rejections - 35 USC § 103
Claim 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Bowman et. al. (“Generating Sentences from a Continuous Space”; hereinafter “Bowman”) in view of Chung et al. (“A Recurrent Latent Variable Model for Sequential Data”; hereinafter “Chung”).
As per Claim 1, Bowman teaches a method of training a recurrent machine-learned model having an encoder network and a decoder network the method comprising (Bowman, Abstract, discloses a recurrent model:  “The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnn-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences.”  Bowman, Introduction Para 3 on Page 1, discloses a method of training:  “Our contributions are as follows: We propose a variational autoencoder architecture for text and discuss some of the obstacles to training it as well as our proposed solutions.”  Bowman, Section 3 on Page 3, discloses the encoder and decoder:  “We adapt the variational autoencoder to text by using single-layer lstm rnns (Hochreiter and Schmidhuber, 1997) for both the encoder and the decoder, essentially forming a sequence autoencoder with the Gaussian prior acting as a regularizer on the hidden code.”)
obtaining a sequence of observations (Bowman, Section 3 on Page 3, discloses:  “We adapt the variational autoencoder to text by using single-layer lstm rnns (Hochreiter and Schmidhuber, 1997) for both the encoder and the decoder, essentially forming a sequence autoencoder with the Gaussian prior acting as a regularizer on the hidden code.”  Here, Bowman discloses a “sequence autoencoder”, in which next values of a sequence are predicted based on previous values of a sequence.  This is explicitly illustrated in Bowman Table 3:

    PNG
    media_image1.png
    317
    735
    media_image1.png
    Greyscale

for each observation in the sequence, repeatedly performing the steps of: 
generating a current latent distribution for a current observation by applying the encoder network to the current observation and values of the encoder network for one or more previous observations, the current latent distribution representing a distribution for a latent state of the current observation given a value of the current observation and a latent state for the one or more previous observations (Bowman, Section 3 on Page 3 as shown above, state that they “adapt the variational autoencoder”.  Bowman provides some background on this in Page 2 Section 2.1 and Section 2.2:

    PNG
    media_image2.png
    600
    772
    media_image2.png
    Greyscale

As shown in Section 2.1, “Phienc” is the encoder function, “x” is an observation, and “z” is a “learned code” (or, a “latent state”) of “x”.  In Section 2.2, it is disclosed that in the variational autoencoder, Phienc produces a probability distribution for “z” called q(z|x).  This may be called a “current latent distribution”, and it is generated by applying the encoder to the current observation.  It represents a distribution of the latent state “z”.  Bowman, Section 3, also discloses that they use “LSTM” for “both the encoder and the decoder”.  LSTM is a recurrent model, and therefore latent states from one or more previous observations are input to the encoder.  This is explicitly illustrated in Figure 1:

    PNG
    media_image3.png
    414
    578
    media_image3.png
    Greyscale

generating a prior distribution [by inputting a value generated] from a previous latent distribution for at least a previous observation directly preceding the current observation in the sequence [directly to an input layer of a transition network], the prior distribution representing a distribution for the latent state of the current observation given the latent state for the one or more previous observations independent of the value of the current observation (Bowman, Page 2 Section 2.2, discloses:  “This model imposes a prior distribution on the hidden codes z which enforces a regular geometry over codes and makes it possible to draw proper samples from the model using ancestral sampling.”  Here Bowman discloses a prior distribution, and it is for an estimated latent state z.  This latent distribution z is from a previous observation directly preceding the current observation in the sequence, and therefore the generated prior distribution is for an estimated latent state for the one or more previous observations from a previous latent distribution.  The prior distribution represents a distribution of the latent state of the current observation (“z” is the latent state). As discussed above, the latent state of the current observation is partly based on the latent state of a previous observation directly preceding the current observation in the sequence.  This is shown by Bowman Figure 1, which shows a latent distribution z being generated for a sentence.  The latent state of this sentence is used to calculate the next sentence, or next words in the sentence, and is thus directly preceding the current observation in the sequence.  For example, Bowman states on Page 1 Abstract:  “By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences.”  Bowman thus uses the latent states of the immediately preceding sentences in order to predict the missing words in the sentence of the current observation, as stated in Page 5 Section 5: “We claim that the our VAE's global sentence features make it especially well suited to the task of imputing missing words in otherwise known sentences.”  Since the latent state(s) of one or more previous observations were calculated previously to input of the current observation, then the latent state(s) of one or more previous observations are thus independent of the value of the current observation.)
generating an estimated latent state for the current observation from the current latent distribution (Bowman, Page 2 Section 2.2 Last Paragraph, discloses:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(z|x)”, where “q (z|x)” is a probability of “z given x”, wherein z is the latent state. A single sample from this is an estimated latent state from the current latent distribution “q (z|x)”).
generating a predicted likelihood for observing a subsequent observation that comes after the current observation in the sequence given the latent state for the current observation by applying the decoder network to the estimated latent state for the current observation (Bowman, Page 2 Section 2.1 Para 2, discloses:  “a probabilistic decoder model p(x|z =Phienc(x)), and maximizes the likelihood of an example x conditioned on z, the learned code for x.”  Here, the decoder generates a predicted likelihood of the next observation, based on the estimated latent state z.  As shown above, Bowman discloses using previous sentences to predict the subsequent sentence, or subsequent missing words in a sentence, as also shown in Bowman Page 5 Table 3 below:

    PNG
    media_image4.png
    216
    704
    media_image4.png
    Greyscale
)
and determining a loss for the current observation including a combination of a prediction loss and a divergence loss, the prediction loss indicating a difference between the predicted likelihood and the subsequent observation, and the divergence loss indicating a measure of difference between the current latent distribution and the prior distribution (Bowman, Page 2 Section 2.2 and Para 3, discloses:

    PNG
    media_image5.png
    394
    892
    media_image5.png
    Greyscale

Here, Bowman discloses an “objective function”, which is just the opposite of a “loss function”, in which the signs are switched.  This includes a divergence loss (the first term in the equation with “KL”), which is a KL divergence is between the latent distribution q(z | x) and the prior distribution p(z)).  This also includes a prediction loss (the second term in the equation), which indicates a difference between the predicted likelihood and the subsequent observation.  The second term in this equation is an expected negative reconstruction error, the error being between the predicted likelihood and the subsequent observation.  Examiner’s Note:  Further confirmation that the second term in the equation is an “expected negative reconstruction error” can be found in the work from which Bowman has cited for this work, “Auto-Encoding Variational Bayes” by Kingma et al. on Page 4:  “The first term is (the KL divergence of the approximate posterior from the prior) acts as a regularizer, while the second term is a an expected negative reconstruction error.”)
	backpropagating one or more error terms from the loss function to update parameters of the encoder network and the decoder network (Bowman, Page 2 Section 2.2 Concludes:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(~zjx), but compute the kl divergence term of the cost function in closed form, again following Kingma and Welling (2015).” Here, Bowman discloses “stochastic gradient descent”, which is a form of backpropagation.  Bowman as shown above disclosed a loss function, and has disclosed that their network comprises an encoder and decoder network.)
	However, Bowman does not explicitly teach the machine-learned model having a transition network; that the generating a prior distribution is by inputting a value generated from a previous latent distribution for at least a previous observation directly preceding the current observation in the sequence directly to an input layer of a transition network; and backpropagating one or more error terms from the loss function to the transition network.
	Chung teaches the machine-learned model having a transition network;  generating a prior distribution by inputting a value generated from a previous latent distribution for at least a previous observation directly preceding the current observation in the sequence directly to an input layer of a transition network (Chung, Page 4, discloses:  “The VRNN contains a VAE at every timestep. However, these VAEs are conditioned on the state variable ht-1 of an RNN. This addition will help the VAE to take into account the temporal structure of the sequential data. Unlike a standard VAE, the prior on the latent random variable is no longer a standard Gaussian distribution, but follows the distribution

    PNG
    media_image6.png
    27
    428
    media_image6.png
    Greyscale

where μ0,t and σ0,t denote the parameters of the conditional prior distribution. Moreover, the
generating distribution will not only be conditioned on zt but also on ht-1 such that:

    PNG
    media_image7.png
    24
    457
    media_image7.png
    Greyscale

where μx,t and σx,t denote the parameters of the generating distribution, ϕtprior and ϕtdec can be any highly flexible function such as neural networks. ϕtx and ϕtz can also be neural networks, which extract features from xt and zt, respectively. We found that these feature extractors are crucial for learning complex sequences.”  Here, Chung discloses a transition network (“ϕtprior… can be any highly flexible function such as neural networks”) from which is generated a prior distribution (“ϕtprior”).  Input directly into this transition network (“ϕtprior(ht-1”) is a value generated from a previous latent distribution (“ht-1”, a hidden, or “latent” state) for at least a previous observation directly preceding the current observation in the sequence (“t-1”)).
	determining a loss function of the sequence of observations as a combination of the losses for each observation in the sequence (Bowman did not explicitly recite this, but Chung does on Page 5:

    PNG
    media_image8.png
    202
    628
    media_image8.png
    Greyscale

Here, the loss function is over a sequence of observations as a combination of losses for each observation in the sequence, as evidenced by the capital Sigma summation sign from t = 1 to T.  Also note that this objective function is much like Bowman’s objective function.  It is a combination of KL divergence and prediction loss.) 
backpropagating one or more error terms from the loss function to the transition network (Chung, Page 4, End of Top Paragraph, discloses:  “The inference model can then be trained through standard backpropagation technique for stochastic gradient descent.”  Chung, as shown above on Page 4, discloses that the transition network is a neural network with parameters:  “where μx,t and σx,t denote the parameters of the generating distribution, ϕtprior and ϕtdec can be any highly flexible function such as neural networks.”)
Bowman and Chung are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the invention to combine the recurrent variational autoencoder with KL divergence from a prior distribution of Bowman with the variational recurrent neural network with KL divergence from a prior distribution generated via a neural network of Chung.  One would be motivated to so in order to model temporal dependencies, which would allow one to achieve performance gains on predictions of sequential data (Chung Page 2:  “With these considerations in mind, we suggest that our model variability should induce temporal dependencies across timesteps… We demonstrate that for the speech modelling tasks, the VRNN-based models significantly outperform the RNN-based models and the VRNN model that does not integrate temporal dependencies between latent random variables.”)

	As per Claim 2, the combination of Bowman and Chung teaches the method of Claim 1.  Bowman teaches wherein the estimated latent state for the current observation is generated by sampling one or more values from the latent distribution for the current observation (Bowman, Page 2 Section 2.2 Last Paragraph, discloses:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(z|x)”, where “q (z|x)” is a probability of “z given x”, wherein z is the latent state. A “single sample” (sampling one or more values) from this is an estimated latent state from the current latent distribution “q (z|x)”).

As per Claim 3, the combination of Bowman and Chung teaches the method of Claim 2.  Bowman teaches wherein generating the predicted likelihood comprises generating one or more predicted likelihoods of observing the subsequent observation by applying the decoder network to the one or more sampled values from the latent distribution for the current observation (Bowman, Page 2 Section 2.1 Para 2, discloses:  “a probabilistic decoder model p(x|z =Phienc(x)), and maximizes the likelihood of an example x conditioned on z, the learned code for x.”  Here, the decoder generates (maximizes) a predicted likelihood of x, based on the latent state z, which is the “learned code” (“sampled value”) from the latent distribution for the current observation x.)

As per Claim 4, the combination of Bowman and Chung teaches the method of Claim 3.  Bowman teaches wherein the prediction loss is an expected value of the one or more predicted likelihoods (Bowman, Page 2 Section 2.2 Para 3, discloses the “E” term:

    PNG
    media_image9.png
    457
    957
    media_image9.png
    Greyscale

As per Claim 5, the combination of Bowman and Chung teaches the method of Claim 1.  Bowman teaches wherein the divergence loss is a Kullback-Leibler divergence between the prior distribution and the current latent distribution (Bowman, Page 2 Section 2.2 Para 3, in the equation 1 shown in the screenshot above, discloses KL divergence between the latent distribution q(z | x) and the prior distribution p(z)).

As per Claim 6, the combination of Bowman and Chung teaches the method of Claim 1.  Bowman teaches wherein the current latent distribution is defined by a set of statistical parameters of a probability distribution, and wherein the encoder network is configured to output the set of statistical parameters. (Bowman provides some background on this in Page 2 in Section 2.1 and Section 2.2:

    PNG
    media_image2.png
    600
    772
    media_image2.png
    Greyscale

As shown in Section 2.1, “Phienc” is the encoder function, “x” is an observation, and “z” is a “learned code” (or, a “latent state”) of “x”.  In Section 2.2, it is disclosed that in the variational autoencoder, Phienc produces a probability distribution for “z” called q(z|x).  This may be called a “current latent distribution”, and it is generated by applying the encoder to the current observation.  It represents a distribution of the latent state “z” and is thus a latent distribution.  A probability distribution is defined by parameters (for example, mean and variance).  As Bowman stated above, this distribution (and thus its parameters) have been produces by the encoder network, so therefore the encoder network is configured to output the set of statistical parameters.) 

As per Claim 7, the combination of Bowman and Chung teaches the method of Claim 1.  Bowman teaches the value is sampled from the previous latent distribution (Bowman, Page 2 Section 2.2 Last Paragraph, discloses:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(z|x)”, where “q (z|x)” is a probability of “z given x”, wherein z is the latent state. A “single sample” (sampling one or more values) from this is an estimated latent state from the current latent distribution “q (z|x)”)).
However, Bowman does not explicitly teach wherein the prior distribution is defined by a set of statistical parameters of a probability distribution, wherein the value is sampled from the previous latent distribution, and wherein generating the prior distribution comprises: applying the transition network to the value sampled from the previous latent distribution to generate one or more corresponding output values; estimating the set of statistical parameters for the prior distribution from the one or more output values.
 Chung teaches wherein the prior distribution is defined by a set of statistical parameters of a probability distribution, wherein the value is sampled from the previous latent distribution, and wherein generating the prior distribution comprises: applying the transition network to the value sampled from the previous latent distribution to generate one or more corresponding output values; estimating the set of statistical parameters for the prior distribution from the one or more output values.  (Recall above that Bowman teaches value sampled from the previous latent distribution. Chung, Page 4, discloses:  “The VRNN contains a VAE at every timestep. However, these VAEs are conditioned on the state variable ht-1 of an RNN. This addition will help the VAE to take into account the temporal structure of the sequential data. Unlike a standard VAE, the prior on the latent random variable is no longer a standard Gaussian distribution, but follows the distribution

    PNG
    media_image6.png
    27
    428
    media_image6.png
    Greyscale

where μ0,t and σ0,t denote the parameters of the conditional prior distribution. Moreover, the
generating distribution will not only be conditioned on zt but also on ht-1 such that:

    PNG
    media_image7.png
    24
    457
    media_image7.png
    Greyscale

where μx,t and σx,t denote the parameters of the generating distribution, ϕtprior and ϕtdec can be any highly flexible function such as neural networks. ϕtx and ϕtz can also be neural networks, which extract features from xt and zt, respectively. We found that these feature extractors are crucial for learning complex sequences.”  Here, Chung discloses wherein the prior distribution (“ϕtprior”) is defined by a set of statistical parameters of a probability distribution (“where μ0,t and σ0,t denote the parameters of the conditional prior distribution”).  
Examiner also notes that like Bowman shown above, Chung also discloses wherein the value is sampled from the previous latent distribution (“ht-1”, which is a hidden, or “latent” state). This is derived from a sample from the previous latent distribution, as Chung Page 3 discloses:  “Their model, called STORN, first generates a sequence of samples z = (z1, …, zT ) from the sequence of independent latent random variables. At each timestep, the transition function f from Eq. (1) computes the next hidden state ht based on the previous state ht-1, the previous output xt-1 and the sampled latent random variables zt.”  Here, Chung discloses a work that performs sampling, then states that they are basing their work on this:  “These approaches are closely related to the approach proposed in this paper. However, there is a major difference in how the prior distribution over the latent random variable is modelled. Unlike the aforementioned approaches, our approach makes the prior distribution of the latent random variable at timestep t dependent on all the preceding inputs via the RNN hidden state ht-1 (see Eq. (5))”  Therefore, Chung still discloses the sampling, but with some difference in how the distribution is modelled. 
and wherein generating the prior distribution comprises: applying the transition network to the value sampled from the previous latent distribution to generate one or more corresponding output values (As just shown above, both Bowman and Chung disclose a value sampled from the previous latent distribution.  Chung above discloses applying the transition network to a value (“ϕtprior(ht-1)”), and applying the network will generate an output value.  The value ht-1 is derived from the sampled value zt-1 (“computes the next hidden state ht based on … the sampled latent random variables zt.”) and therefore the combined function of the derivation of ht-1 and the prior generation neural network comprises the transition network, which is applied to the sampled value zt-1.)
estimating the set of statistical parameters for the prior distribution from the one or more output values (Chung above discloses:  “μ0,t and σ0,t denote the parameters of the conditional prior distribution” and “ϕtprior and ϕtdec can be any highly flexible function such as neural networks”.  Thus ϕtprior is a neural network that estimates the values of the parameters μ0,t and σ0,t)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Bowman and Chung for at least the reasons recited in Claim 1.

As per Claim 8, Claim 8 is a non-transitory computer-readable medium claim corresponding to method claim 1.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Chung, [0117], discloses:  “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus”.  Claim 8 is rejected for the same reasons as claim 1.

As per Claim 9, Claim 9 is a non-transitory computer-readable medium claim corresponding to method claim 2.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 9 is rejected for the same reasons as claim 2.

As per Claim 10, Claim 10 is a non-transitory computer-readable medium claim corresponding to method claim 4.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 10 is rejected for the same reasons as claim 4.

As per Claim 11, Claim 11 is a non-transitory computer-readable medium claim corresponding to method claim 5.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 11 is rejected for the same reasons as claim 5.

As per Claim 12, Claim 12 is a non-transitory computer-readable medium claim corresponding to method claim 6.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 12 is rejected for the same reasons as claim 6.

As per Claim 13, Claim 13 is a non-transitory computer-readable medium claim corresponding to method claim 7.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 13 is rejected for the same reasons as claim 7.

As per Claim 14, Claim 14 is a model claim corresponding to method claim 1.  The difference is that it recites a computer readable storage medium.  Chung, Page 8 Acknowledgements, discloses:  “The authors would like to thank the developers of Theano [1]. Also, the authors thank Kyunghyun Cho, Kelvin Xu and Sungjin Ahn for insightful comments and discussion. We acknowledge the support of the following agencies for research funding and computing support: Ubisoft, Nuance Foundation, NSERC, Calcul Qu´ebec, Compute Canada, the Canada Research Chairs and CIFAR.”  Here, Chung discloses the use of “computing support” as well as “Theano”, which is a CPU and GPU mathematical compiler.  Thus, Chung discloses the use of a model comprising a computer readable storage medium.  Claim 14 is rejected for the same reasons as claim 1.

As per Claim 15, Claim 15 is a model claim corresponding to method claim 2.  The difference is that it recites a computer readable storage medium.  Claim 15 is rejected for the same reasons as claim 2.

As per Claim 16, Claim 16 is a model claim corresponding to method claim 3.  The difference is that it recites a computer readable storage medium.  Claim 16 is rejected for the same reasons as claim 3.

As per Claim 17, Claim 17 is a model claim corresponding to method claim 4.  The difference is that it recites a computer readable storage medium.  Claim 17 is rejected for the same reasons as claim 4.

As per Claim 18, Claim 18 is a model claim corresponding to method claim 5.  The difference is that it recites a computer readable storage medium.  Claim 18 is rejected for the same reasons as claim 5.

As per Claim 19, Claim 19 is a model claim corresponding to method claim 6.  The difference is that it recites a computer readable storage medium.  Claim 19 is rejected for the same reasons as claim 6.

As per Claim 20, Claim 20 is a model claim corresponding to method claim 7.  The difference is that it recites a computer readable storage medium.  Claim 20 is rejected for the same reasons as claim 7.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Le et al. (“Variational Memory Encoder-Decoder”) is based on the work of cited prior art reference Chung, and Page 3 Section 3 discloses:  “With an external memory module, VMED explicitly models the dependencies between latent random variables across subsequent timesteps” and on Page 4 discloses:  “Armed with the prior, we follow a recurrent generative process by alternatively using the memory to compute the MoG and using latent variable z sampled from the MoG to update the memory and produce the output conditional distribution.”
Bayer et al. (“Learning Stochastic Recurrent Networks”) is the work that Chung is based on, and discloses sampling and prior distributions
Serban et al. (“A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues”) discloses the use of encoder, decoder, latent variables, and sampling for sequential data (dialogues)
Kingma et al. (“Auto-Encoding Variational Bayes”) is cited by Chung for the KL divergence term of the cost function
Wayne et al. (WO 2018/142378 A1) discloses a prior generation neural network and a decoder, with the previous time step being input to the prior network in [43]:  “the system 100 further provides latent variables 116 for the previous time step and/or an updated hidden state of a controller network 106 for the time step (both of which are described later) as input to the prior generation network 114” and sampling in [44]:  “The system determines the latent variables 116 for the time step by drawing a random sample from the prior distribution generated by the prior generation network 114 for the time step.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/VIKER A LAMARDO/Primary Examiner, Art Unit 2126