DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Action is non-final and is in response to the claims filed June 7, 2019. Claims 1-20 are currently pending, of which claims 1-20 are currently rejected.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2019-09-30 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Oath/Declaration
Examiner notes that Applicant has not submitted an Oath/Declaration.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 14-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because they are directed to “software per se”.  The claims recite “a recurrent machine-learned model stored on a computer readable storage medium”.  There is no language in the Specification that limits the “recurrent machine learned 
Claim Rejections - 35 USC § 103
Claim 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Bowman et. al. (“Generating Sentences from a Continuous Space”; hereinafter “Bowman”) in view of Menick et. al. (US 2021/0004677 A1; hereinafter “Menick”).
As per Claim 1, Bowman teaches a method of training a recurrent machine-learned model having an encoder network and a decoder network the method comprising (Bowman, Abstract, discloses a recurrent model:  “The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnn-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences.”  Bowman, Introduction Para 3 on Page 1, discloses a method of training:  “Our contributions are as follows: We propose a variational autoencoder architecture for text and discuss some of the obstacles to training it as well as our proposed solutions.”  Bowman, Section 3 on Page 3, discloses the encoder and decoder:  “We adapt the variational autoencoder to text by using single-layer lstm rnns (Hochreiter and Schmidhuber, 1997) for both the encoder and the decoder, essentially forming a sequence autoencoder with the Gaussian prior acting as a regularizer on the hidden code.”)
obtaining a sequence of observations (Bowman, Section 3 on Page 3, discloses:  “We adapt the variational autoencoder to text by using single-layer lstm rnns (Hochreiter and Schmidhuber, 1997) for both the encoder and the decoder, essentially forming a sequence autoencoder with the Gaussian prior acting as a regularizer on the hidden code.”  Here, Bowman discloses a “sequence autoencoder”, in which next values of a sequence are predicted based on previous values of a sequence.  This is explicitly illustrated in Bowman Table 3:

    PNG
    media_image1.png
    317
    735
    media_image1.png
    Greyscale

for each observation in the sequence, repeatedly performing the steps of: 
generating a current latent distribution for a current observation by applying the encoder network to the current observation and values of the encoder network for one or more previous observations, the current latent distribution representing a distribution for a (Bowman, Section 3 on Page 3 as shown above, state that they “adapt the variational autoencoder”.  Bowman provides some background on this in Page 2 Section 2.1 and Section 2.2:

    PNG
    media_image2.png
    600
    772
    media_image2.png
    Greyscale

As shown in Section 2.1, “Phienc” is the encoder function, “x” is an observation, and “z” is a “learned code” (or, a “latent state”) of “x”.  In Section 2.2, it is disclosed that in the variational autoencoder, Phienc produces a probability distribution for “z” called q(z|x).  This may be called a “current latent distribution”, and it is generated by applying the encoder to the current observation.  It represents a distribution of the latent state “z”.  Bowman, Section 3, also discloses that they use “LSTM” for “both the encoder and the decoder”.  LSTM is a recurrent model, and therefore latent states from one or more previous observations are input to the encoder.  This is explicitly illustrated in Figure 1:

    PNG
    media_image3.png
    334
    447
    media_image3.png
    Greyscale

generating a prior distribution for an estimated latent state for the one or more previous observations generated from previous latent distributions for the one or more previous observations, the prior distribution representing a distribution for the latent state of the current observation given the latent state for the one or more previous observations independent of the value of the current observation (Bowman, Page 2 Section 2.2, discloses:  “This model imposes a prior distribution on the hidden codes z which enforces a regular geometry over codes and makes it possible to draw proper samples from the model using ancestral sampling.”  Here Bowman discloses a prior distribution, and it is for an estimated latent state z.  As shown above, previous observations are taken into account when generating z, and therefore the generated prior distribution is for an estimated latent state for the one or more previous observations from previous latent distributions.  The prior distribution represents a distribution of the latent state of the current observation (“z” is the latent state). As discussed above, the latent state of the current observation is partly based on the latent state of one or more previous observations.  Since the latent state(s) of one or more previous observations were calculated previously to input of the current observation, then the latent state(s) of one or more previous observations are thus independent of the value of the current observation.)
generating an estimated latent state for the current observation from the current latent distribution (Bowman, Page 2 Section 2.2 Last Paragraph, discloses:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(z|x)”, where “q (z|x)” is a probability of “z given x”, wherein z is the latent state. A single sample from this is an estimated latent state from the current latent distribution “q (z|x)”).
generating a predicted likelihood for observing a subsequent observation given the latent state for the current observation by applying the decoder network to the estimated latent state for the current observation (Bowman, Page 2 Section 2.1 Para 2, discloses:  “a probabilistic decoder model p(x|z =Phienc(x)), and maximizes the likelihood of an example x conditioned on z, the learned code for x.”  Here, the decoder generates a predicted likelihood of the next observation, based on the estimated latent state z.)
and determining a loss for the current observation including a combination of a prediction loss and a divergence loss, the prediction loss increasing as the predicted likelihood for the subsequent observation decreases, and the divergence loss indicating a measure of (Bowman, Page 2 Section 2.2 and Para 3, discloses:

    PNG
    media_image4.png
    394
    892
    media_image4.png
    Greyscale

Here, Bowman discloses an “objective function”, which is just the opposite of a “loss function”.  The signs of the terms are switched when this is considered a “loss function”, and in such a case, the second term would be negative, and therefore the loss would increase as the Expectation (predicted likelihood) decreases.  The KL divergence is between the latent distribution q(z | x) and the prior distribution p(z)).
	backpropagating one or more error terms from the loss function to update parameters of the encoder network and the decoder network (Bowman, Page 2 Section 2.2 Concludes:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(~zjx), but compute the kl divergence term of the cost function in closed form, again following Kingma and Welling (2015).” Here, Bowman discloses “stochastic gradient descent”, which is a form of backpropagation.  Bowman as shown above disclosed a loss function, and has disclosed that their network comprises an encoder and decoder network.)
	However, Bowman does not explicitly teach the machine-learned model having a transition network; that the generating a prior distribution is by applying the transition network to an estimated latent state; and backpropagating one or more error terms from the loss function to the transition network.
	Menick teaches the machine-learned model having a transition network; generating a prior distribution by applying the transition network to an estimated latent state (Menick, [0049], discloses:  “The system described in this specification trains a prior neural network in tandem with an encoder neural network and a decoder neural network in a variational autoencoder framework.”  Here, Menick discloses that the model comprises, in addition to the encoder and decoder networks, also a “prior neural network”.   Menick, Para [0072], discloses:  “For each of the observations 114 in the current batch, the system 100 processes the neighboring code 128 for the observation using the prior neural network 106. The prior neural network 106 is configured to process the neighboring code to generate an output that includes the parameters of a prior probability distribution 124 that models the code for the observation. Like the encoding probability distribution 116, the prior probability distribution 124 is a probability distribution over the latent space.”  Here, Menick discloses that the “prior neural network” is a network that produces a “prior probability distribution”, which is analogous to the function of the “transition network” of the instant claim.  The “code” to which the prior neural network is applied may be considered an estimated latent state, as it comes from the output of the encoder, and the prior distribution is said to be “over the latent space”.  This is also shown well in Fig. 1:

    PNG
    media_image5.png
    707
    657
    media_image5.png
    Greyscale
)
determining a loss function of the sequence of observations as a combination of the losses for each observation in the sequence (Bowman did not explicitly recite this, but Menick does in [0088]:

    PNG
    media_image6.png
    574
    892
    media_image6.png
    Greyscale

Here, the loss function is over a sequence of observations as a combination of losses for each observation in the sequence, as evidenced by the capital Sigma summation sign.  Also note that this loss function is much like Bowman’s objective function.  It is a divergence loss and prediction loss, wherein the loss increases as the predicted likelihood decreases.) 
backpropagating one or more error terms from the loss function to the transition network (Menick, [0089], discloses backpropagating for all 3 of the encoder, decoder, and prior (transition) networks:  “The system may determine the gradients of the loss function with respect to the parameters of the encoder neural network, the decoder neural network, and the prior neural network in any appropriate manner, for example, using backpropagation.”)

It would have been obvious before the effective filing date of the invention to combine the recurrent variational autoencoder with KL divergence from a prior distribution of Bowman with the variational autoencoder with KL divergence from a prior distribution generated via a “prior network” of Menick.  One would be motivated to so in order to gain increased efficiency by reducing the cost of computations needed to generate the prior distribution, and achieving better results by the prior distribution better reflecting the surrounding observations.   The benefits of Menick’s approach are explained well in the accompanying paper that was included in the provisional application by Graves, Menick, and van den Oord (“Associative Compression Networks for Representation Learning”), Abstract:  (“This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders (VAEs) in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the dataset using an ordering determined by proximity in latent space. Since the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes. Crucially, the codes remain informative when powerful, autoregressive decoders are used, which we argue is fundamentally difficult with normal VAEs.”)

	As per Claim 2, the combination of Bowman and Menick teaches the method of Claim 1.  Bowman teaches wherein the estimated latent state for the current observation is (Bowman, Page 2 Section 2.2 Last Paragraph, discloses:  “We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from q(z|x)”, where “q (z|x)” is a probability of “z given x”, wherein z is the latent state. A “single sample” (sampling one or more values) from this is an estimated latent state from the current latent distribution “q (z|x)”).

As per Claim 3, the combination of Bowman and Menick teaches the method of Claim 2.  Bowman teaches wherein generating the predicted likelihood comprises generating one or more predicted likelihoods of observing the subsequent observation by applying the decoder network to the one or more sampled values from the latent distribution for the current observation (Bowman, Page 2 Section 2.1 Para 2, discloses:  “a probabilistic decoder model p(x|z =Phienc(x)), and maximizes the likelihood of an example x conditioned on z, the learned code for x.”  Here, the decoder generates (maximizes) a predicted likelihood of x, based on the latent state z, which is the “learned code” (“sampled value”) from the latent distribution for the current observation x.)

As per Claim 4, the combination of Bowman and Menick teaches the method of Claim 3.  Bowman teaches wherein the prediction loss is an expected value of the one or more predicted likelihoods (Bowman, Page 2 Section 2.2 Para 3, discloses the “E” term:

    PNG
    media_image7.png
    457
    957
    media_image7.png
    Greyscale

As per Claim 5, the combination of Bowman and Menick teaches the method of Claim 1.  Bowman teaches wherein the divergence loss is a Kullback-Leibler divergence between the prior distribution and the current latent distribution (Bowman, Page 2 Section 2.2 Para 3, in the equation 1 shown in the screenshot above, discloses KL divergence between the latent distribution q(z | x) and the prior distribution p(z)).

As per Claim 6, the combination of Bowman and Menick teaches the method of Claim 1.  Bowman teaches wherein the current latent distribution is defined by a set of statistical parameters of a probability distribution, and wherein the encoder network is configured to output the set of statistical parameters. (Bowman provides some background on this in Page 2 in Section 2.1 and Section 2.2:

    PNG
    media_image2.png
    600
    772
    media_image2.png
    Greyscale

As shown in Section 2.1, “Phienc” is the encoder function, “x” is an observation, and “z” is a “learned code” (or, a “latent state”) of “x”.  In Section 2.2, it is disclosed that in the variational autoencoder, Phienc produces a probability distribution for “z” called q(z|x).  This may be called a “current latent distribution”, and it is generated by applying the encoder to the current observation.  It represents a distribution of the latent state “z” and is thus a latent distribution.  A probability distribution is defined by parameters (for example, mean and variance).  As Bowman stated above, this distribution (and thus its parameters) have been produces by the encoder network, so therefore the encoder network is configured to output the set of statistical parameters.) 

As per Claim 7, the combination of Bowman and Menick teaches the method of Claim 1.  Menick teaches wherein the prior distribution is defined by a set of statistical parameters of a probability distribution, and wherein generating the prior distribution comprises: applying the transition network to one or more values sampled from the previous latent distributions to generate one or more corresponding output values; estimating the set of statistical parameters for the prior distribution from the one or more output values.  (Menick, Para [0083], discloses:  “The system assigns an updated code to the given observation based on the parameters of the encoding probability distribution over the latent space (208). For example, the system may assign an updated code to the given observation that is given by a vector representing the mean of the encoding probability distribution.”  Here, Menick discloses one or more values sampled from previous latent distributions (“assigns an updated code to the given observation based on the parameters of the encoding probability distribution”, wherein the “encoding probability distribution” is the latent distribution “over the latent space”).  This latent distribution is generated “previous” to the generation of the prior distribution.  Menick, Para [0084], discloses that this “updated code” is used to derive a “neighboring code”:  “The system selects a “neighboring” code that is assigned to an additional observation (i.e., that is different than the given observation) based on a similarity of the neighboring code to the updated code assigned to the given observation (210)”.  This is part of the sampling process, and thus the “neighboring code” becomes the value sampled from the latent distribution that is input to the transition network.  Menick, Para [0085], discloses the use of the “neighboring code”:  “The system provides the neighboring code as input to the prior neural network, which is configured to process the neighboring code in accordance with current parameter values of the prior neural network to generate as output parameters of a prior probability distribution over the latent space”.  Here Menick discloses that the prior distribution is defined by a set of statistical parameters of a probability distribution, and that these parameters are calculated by generating output values from the “prior neural network” (transition network).  This output from the transition distribution is used to estimate the statistical parameters representing the prior distribution.)

As per Claim 8, Claim 8 is a non-transitory computer-readable medium claim corresponding to method claim 1.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Menick, [0117], discloses:  “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus”.  Claim 8 is rejected for the same reasons as claim 1.

As per Claim 9, Claim 9 is a non-transitory computer-readable medium claim corresponding to method claim 2.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 9 is rejected for the same reasons as claim 2.

As per Claim 10, Claim 10 is a non-transitory computer-readable medium claim corresponding to method claim 4.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 10 is rejected for the same reasons as claim 4.

As per Claim 11, Claim 11 is a non-transitory computer-readable medium claim corresponding to method claim 5.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 11 is rejected for the same reasons as claim 5.

As per Claim 12, Claim 12 is a non-transitory computer-readable medium claim corresponding to method claim 6.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 12 is rejected for the same reasons as claim 6.

As per Claim 13, Claim 13 is a non-transitory computer-readable medium claim corresponding to method claim 7.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 13 is rejected for the same reasons as claim 7.

As per Claim 14, Claim 14 is a model claim corresponding to method claim 1.  The difference is that it recites a computer readable storage medium.  Menick, [0117], discloses:  “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus”.  Claim 14 is rejected for the same reasons as claim 1.

As per Claim 15, Claim 15 is a model claim corresponding to method claim 2.  The difference is that it recites a computer readable storage medium.  Claim 15 is rejected for the same reasons as claim 2.

As per Claim 16, Claim 16 is a model claim corresponding to method claim 3.  The difference is that it recites a computer readable storage medium.  Claim 16 is rejected for the same reasons as claim 3.

As per Claim 17, Claim 17 is a model claim corresponding to method claim 4.  The difference is that it recites a computer readable storage medium.  Claim 17 is rejected for the same reasons as claim 4.

As per Claim 18, Claim 18 is a model claim corresponding to method claim 5.  The difference is that it recites a computer readable storage medium.  Claim 18 is rejected for the same reasons as claim 5.

As per Claim 19, Claim 19 is a model claim corresponding to method claim 6.  The difference is that it recites a computer readable storage medium.  Claim 19 is rejected for the same reasons as claim 6.

As per Claim 20, Claim 20 is a model claim corresponding to method claim 7.  The difference is that it recites a computer readable storage medium.  Claim 20 is rejected for the same reasons as claim 7.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Blundell et. al. (WO 2018/172513 A1), in [00082], discloses an augmented loss function which includes the KL divergence between a posterior and prior distribution, and in [0011], includes recurrent neural networks
Clayton et. al. (US 2017/0206464 A1), in [0037], discloses a variational autoencoder in which a loss function includes KL divergence which is used to measure the difference between prior and posterior distributions
Zheng (US 10,346,524 B1) also discloses a variational autoencoder in which the loss function includes KL divergence.
Macready et. al. (US 2021/0089884 A1) , in [0098], also discloses a variational autoencoder in which the loss function includes KL divergence
Dunning et. al. (US 2019/0370637 A1), in [0073] and [0117], also discloses a variational autoencoder in which the loss function includes KL divergence
Rolfe et. al.  (US 2019/0244680 A1), in [0232-0234], also discloses a variational autoencoder in which the loss function includes KL divergence
Kingma et. al. (“Auto-Encoding Variational Bayes”) introduces the Variational Auto-Encoder including KL divergence
Fabius et. al. ("Variational Recurrent Auto-Encoders") proposes a recurrent version of the variational auto-encoder (similar to Bowman, which was relied upon)
Chung et. al. ("Recurrent Latent Variable Model for Sequential Data") also proposes a recurrent version of the variational auto-encoder, like Fabius and Bowman
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-




/L.A.S./Examiner, Art Unit 2126
                                                                                                                                                                                                        /NICHOLAS KLICOS/Primary Examiner, Art Unit 2145