DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	This communication is in response to the Applicant’s submission filed 26 May 2020, where:
Claims 3, 4, 6-8, 10-13, 42, and 43 have been amended.
Claims 14-41 have been cancelled.
New claims 44-48 are presented for examination.
Claims 1-13 and 42-48 are pending.
Claims 1-13 and 42-48 are rejected.
Examiner notes that a separate page of the abstract does not accompany the application; however, under the Office rules, the abstract for a national stage application filed under 35 U.S.C. 371 may be found on the front page of the Patent Cooperation Treaty publication filed with the Office on 26 May 2020. See MPEP § 1893.03(e). Accordingly, no action is required by Applicant.
Information Disclosure Statement
3.	An information disclosure statement was submitted on 11 January 2021. The submission complies with the provisions of 37 CFR 1.97. Accordingly, the examiner considered the information disclosure statement.
Claim Rejections - 35 USC § 112
4.	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
5.	Claims 4, 5, 46, and 47 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
The term “most similar” in claim 4, line 4, and in claim 46, line 4, is a relative term which renders the claim indefinite. The term “most similar” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree (see, e.g., Specification ¶¶ 0008, 0084), and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. For example, the claims respectively recite “identifying . . . a predetermined number of codes that are most similar to the updated code assigned to the given observation; . . . .” Though at best “a predetermined number of codes” would be “most similar,” what criteria is used for determining the “most similar” of these “predetermined number of codes” is not set out. Accordingly, claims 4 and 46 are rejected under Section 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claims 5 and 47 depend directly or indirectly from claims 4 and 46, respectively, and are rejected as depending from a rejected claim; further, the claims fail to cure the deficiencies of claims 4 and 47.
Claim Rejections - 35 U.S.C. § 103
6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
7.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
8.	Claims 1, 3, 6, 8, 9, 11, 13, 42, 43, 45, and 48 are rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20190228312 to Andoni et al. [hereinafter Andoni] in view of Greff et al., “Tagger: Deep Unsupervised Perceptual Grouping,” arXiv (2016) [hereinafter Greff].
Regarding claims 1, 42, and 43, Andoni teaches [a] method for training an encoder neural network, a decoder neural network, and a prior neural network (Andoni, Fig. 1A, teaches an encoder neural network, a decoder neural network, and a prior neural network), [o]ne or more non-transitory computer storage media (Andoni ¶ 0087) of claim 42, and [a] system comprising one or more computers and one or more storage devices (Andoni ¶ 0089 teaches the system may take the form of a computer program product on a computer-readable storage medium or device having computer-readable program code (e.g., instructions) embodied or stored in the storage medium or device. . . . [T]he system 100 may be implemented using one or more computer hardware devices) of claim 43, comprising:
receiving training data (Andoni ¶ 0035 teaches [d]uring operation in the training mode (Fig. 1A), training data is provided to the neural networks 110, 120, 170 (that is, receiving training data)) for training the encoder neural network, the decoder neural network, and the prior neural network (Andoni ¶ 0013 teaches [t]he system 100 may generally operate in two modes of operation: training mode and user mode. Fig. 1A corresponds to an example of the training mode and Fig. 1C corresponds to an example of the use mode), wherein the training data comprises a plurality of observations, and wherein each observation lies in an observation space (Andoni ¶ 0039 teaches [w]hen a new data sample (e.g., readings from multiple sensors) is received, the new data sample may be passed through the clustering and anomaly detection system);
assigning a respective initial code to each observation included in the training data (Andoni ¶ 0014 teaches the first feature may include continuous features (e.g., real numbers,), categorical features (e.g., enumerated values, true/false values, etc.), and/or time-series data. . . . [E]numerated values with more than two possibilities are converted into binary one-hot encoded data (that is, assigning a respective initial code to each observation included in the training data)), wherein a code is numerical representation of an observation (Andoni ¶ 0014 teaches enumerated values with more than two possibilities are converted into binary one-hot encoded data. To illustrate, if the possible values for a variable are "cat," "dog," or "sheep," the variable is converted into a 3-bit value (that is, a numerical representation) where 100 represents "cat," 010 represents "dog," and 001 represents "sheep” (that is, a code is numerical representation of an observation));
training the encoder neural network, the decoder neural network, and the prior neural network on the training data (Andoni Fig. 1A teaches a training mode (Examiner annotations in text-boxes):

    PNG
    media_image1.png
    574
    1226
    media_image1.png
    Greyscale

by repeatedly performing the following operations (Andoni ¶ 0035 teaches input data may be separated into a training set (e.g., 90% of the data) and a testing set (e.g., 10% of the data). The training set may be passed through the system 100 of FIG. 1 during a training epoch (that is, training). The trained system may then be run against the testing set to determine an average loss in the testing set. This process may then be repeated (that is, repeatedly performing) for additional epochs (that is, training the encoder neural network, the decoder neural network, and the prior neural network on the training data by repeatedly performing the following operations)):
selecting a batch of training data (Andoni ¶ 0014 teaches [t]he first input data 101 may be part of a larger data set (that is, selecting a batch of training data));
for each given observation in the selected batch (Andoni ¶ 0014 teaches [t]he first input data 101 . . . may include first features 102 (that is, “features” include each given observation in the selected batch)):
providing the given observation as input to the encoder neural network, which is configured to process the given observation in accordance with current parameter values of the encoder neural network (Andoni ¶ 0017 the second neural network(s) 120 include a variational autoencoder (VAE). The second neural network(s) 120 may receive second input data 104 as input. In a particular aspect, the second input data 104 is generated by a data augmentation process 180 based on a combination of the first input data 101 and the third input data 192) to generate as output parameters of a data-conditional encoding probability distribution over a latent state space (Andoni ¶ 0019 teaches [t]he encoder network 210 may include an input layer 201 including an input node for each of the n first features 102 and an input node for each of the k second features 105. . . . A "latent" layer 203 serves as an output layer of the encoder network 210 and an input layer of the decoder network 220. . . . [T]he encoder network 210 generates values μe, ∑e, which are data vectors having mean and variance values for each of the latent space features. The resulting distribution is sampled to generate the values (denoted "z") in the "latent" layer 203 (that is, to generate as output parameters of a data-conditional encoding probability distribution over a latent state space));
* * *
sampling one or more latent variables from the data-conditional encoding probability distribution (Andoni ¶ 0076 teaches one or more decoding layers configured to generate a reconstruction of the first features based on sampled data from the latent space (that is, sampling one or more latent variables from the data-conditional encoding probability distribution));
providing the latent variables as input to the decoder neural network (Andoni ¶ 0076 teaches one or more decoding layers configured to generate a reconstruction of the first features based on sampled data from the latent space (that is, providing the latent variables as input to the decoder neural network)), which is configured to process the latent variables in accordance with current parameter values of the decoder neural network (Andoni ¶ 0033 teaches Nparams is the number of parameters being adjusted in the system 100 (e.g., link weights, bias functions, bias values, etc. across the neural networks 110, [second NN(s) (VAE)] 120, 170) (that is, “the number of parameters being adjusted” is configured to process the latent variables in accordance with current parameter values of the decoder neural network)) to generate as output parameters of an observation probability distribution over the observation space (Andoni ¶ 0020 teaches [t]he decoder network 220 may approximately reverse the process performed by the encoder network 210 with respect to then features. Thus, the decoder network 220 may include one or more hidden layers 204 and an output layer 205. The output layer 205 outputs a reconstruction of each of the n input features and a variance ( σ2) value for each of the reconstructed features);
determining a gradient of a loss function (Andoni ¶ 0033 teaches The calculator/detector 130 may initiate adjustment at one or more of the first neural network 110, the second neural network(s) 120, or the third neural network 170, based on the aggregate loss L. For example, link weights, bias functions, bias values, etc. may be modified via backpropagation to minimize the aggregate loss L using stochastic gradient descent (that is, determining a gradient of a loss function)), wherein the loss function is based on: (i) a measure of similarity between the data-conditional encoding probability distribution and the prior probability distribution (Andoni ¶ 0028 teaches loss function determination for an entry should also consider distance from clusters. In a particular aspect, cluster distance is incorporated into loss calculation using two Kullback-Leibler (KL) divergences (that is, “cluster distance” is a measure of similarity between the data-conditional encoding probability distribution and the prior probability distribution)), and (ii) a likelihood of the given observation based on the observation probability distribution (Andoni ¶ 0022 teaches the reconstruction loss function LR_confeature for a continuous feature is represented by Gaussian loss (that is, “Gaussian loss” is a maximum likelihood estimation loss function, which is a likelihood of the given observation based on the observation probability distribution)); and
adjusting the current parameter values of the encoder neural network, the decoder neural network, and the prior neural network based on the gradient (Andoni ¶ 0033 teaches [t]he calculator/detector 130 may initiate adjustment at one or more of the first neural network 110, the second neural network(s) 120, or the third neural network 170, based on the aggregate loss L (that is, adjusting the current parameter values of the encoder neural network, the decoder neural network, and the prior neural network based on the gradient)).
Though Andoni teaches training an encoder neural network, a decoder neural network, and a prior neural network that provides latent space feature selection for the encoder and decoder neural networks, Andoni, however, does not explicitly teach -
* * *
determining an updated code for the given observation based on the parameters of the data-conditional encoding probability distribution;
assigning the updated code to the given observation;
selecting a code that is assigned to an additional observation based on a similarity of the code assigned to the additional observation and the updated code assigned to the given observation;
providing the code assigned to the additional observation as input to the prior neural network, which is configured to process the code in accordance with current parameter values of the prior neural network to generate as output parameters of a prior probability distribution over the latent state space;
* * *
But Greff teaches -
* * *
determining an updated code for the given observation based on the parameters of the data-conditional encoding probability distribution (Greff, Fig. 2, teaches Illustration of the TAG framework used for training. (Examiner annotation in text box):

    PNG
    media_image2.png
    353
    776
    media_image2.png
    Greyscale

Greff at p. 2, “2. iTerative Amortized Grouping (TAG) - Grouping,” first paragraph, teaches enable neural networks to split inputs and internal representations into coherent groups that can be processed separately; Greff, Figure 2, caption, teaches that [t]he system learns by denoising its input over iterations using several groups to distribute the representation (that is, based on the parameters of the data-conditional encoding probability distribution). Each group, represented by several panels of the same color (that is, the representation of a “group” is a code), maintains its own estimate of reconstructions zi of the input, and corresponding masks mi, which encode the parts of the input that this group is responsible for representing. These estimates are updated over iterations by the same network, that is, each group and iteration share the weights of the network and only the inputs to the network differ (that is, determining an updated code for the given observation));
assigning the updated code to the given observation (Greff, right portion of Figure 2, caption, recites [i]n each iteration zi-1 and mi-1 from the previous iteration, are used to compute a likelihood term L(mi-1) and modeling error δzi-1. These four quantities are fed to the parametric mapping to produce zi and mi for the next iteration. During learning, all inputs to the network are derived from the corrupted input as shown here (that is, assigning the updated code to the given observation));
selecting a code that is assigned to an additional observation based on a similarity of the code (Greff at p. 2, “2 iTerative Amortized Grouping (TAG),” third paragraph, teaches [p]rocessing of the input is split into K different groups, but it is left up to the network to learn how to best use this ability in a given problem, such as classification. To make the task of instance segmentation easy, we keep the groups symmetric in the sense that each group is processed by the same underlying model. We introduce latent binary variables gk,j to encode if input element xj is assigned to group k (that is, “K different groups” are an additional observation based on a similarity of the code)) assigned to the additional observation and the updated code assigned to the given observation (Greff at p. 2, “2. iTerative Amortized Grouping (TAG) - Amortized Iterative Inference,” first paragraph, teaches [w]e want our model to reason not only about the group assignments but also about the representation of each group. This amounts to inference over two sets of variables: the latent group assignments and the individual group representations; A formulation very similar to mixture models for which exact inference is typically intractable. For these models it is a common approach to approximate the inference in an iterative manner by alternating between (re-)estimation of these two sets (that is, assigned to the additional observation and the updated code assigned to the given observation); see also Greff at p. 7, “3.3 Classification”;
[Examiner notes that the sequence progression by Greff of “i,” “i+1,” etc., represent “additional observations”]);
providing the code assigned to the additional observation as input to the prior neural network, which is configured to process the code in accordance with current parameter values of the prior neural network (Greff at p. 4, “2.1 Definition of the TAG Mechanism - Inputs,” fourth paragraph, teaches [t]his amounts to providing each group information about how likely each input element belongs to them rather than some other group) to generate as output parameters of a prior probability distribution over the latent state space (Greff at p. 4, “2.1 Definition of the TAG Mechanism - Inputs,” first paragraph, teaches [a] parametric mapping (here a neural network) (that is, providing the code assigned to the additional observation as input to the prior neural network) then produces the new estimates                         
                            
                                
                                    m
                                
                                
                                    k
                                
                                
                                    i
                                    +
                                    1
                                
                            
                        
                     and                         
                            
                                
                                    z
                                
                                
                                    k
                                
                                
                                    i
                                    +
                                    1
                                
                            
                        
                    . The initial values form                         
                            
                                
                                    z
                                
                                
                                    k
                                
                                
                                    0
                                
                            
                        
                     are randomized, and                         
                            
                                
                                    z
                                
                                
                                    k
                                
                                
                                    0
                                
                            
                        
                     is set to the data mean for all k. Because zk are continuous variables, their likelihood is a function over all possible values of zk, and not all of this information can be easily represented; see also Greff, Figure 2);
* * *
Andoni and Greff are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Andoni pertaining to latent space cluster mapping operations with the grouping objective representations of Greff.
The motivation for doing so is to achieve improved classification performance over convolutional networks despite being fully connected, by making use of the grouping mechanism. (Greff, Abstract).
Examiner notes that the term "computer” " and "storage devices" recited in Applicant's claims are interpreted to be a well-known hardware structures. 
Examiner notes that the Applicant’s preamble does not afford patentable weight to the Applicant’s claims because the claim preamble is not “necessary to give life, meaning, and vitality” to the claim. Moreover, because the Applicant’s preamble merely states the purpose or intended use of the invention rather than any distinct definition of any of the claimed invention’s limitations, the preamble is not considered a limitation and is of no significance to claim construction.
Regarding claims 3 and 45, the combination of Andoni and Greff teaches all of the limitations of claims 1 and 43, respectively, as described above in detail. 
Andoni teaches -
wherein determining an updated code for the given observation based on the parameters of the data-conditional encoding probability distribution comprises determining the updated code to be the mean vector output by the encoder neural network (Andoni ¶ 0003 teaches [t]he encoder of the VAE produces a mean and a variance (deterministically), which provides a probability distribution in a latent space. During training, that mean and variance is used to randomly sample from a Gaussian distribution to get an encoded vector (that is, as the “encoded vector” is an output based upon a series of samples, this encoded vector output is determining the updated code to be the mean vector output by the encoder neural network), which is then (deterministically) decoded).
Regarding claims 6 and 48, the combination of Andoni and Greff teaches all of the limitations of claims 1 and 43, as described above in detail. 
Andoni teaches -
further comprising, after adjusting the current parameter values of the encoder neural network, the decoder neural network, and the prior neural network based on the gradient for each observation in a batch, for each observation included in the training set:
providing the observation as input to the encoder neural network (Andoni, Fig 1B, teaches input data (that is, observation) (Examiner annotations in text box):

    PNG
    media_image3.png
    365
    395
    media_image3.png
    Greyscale

Andoni ¶ 0017 teaches the second neural network(s) 120 include a variational autoencoder (VAE). The second neural network(s) 120 may receive second input data 104 as input (that is, providing the observation as input to the encoder neural network)), which is configured to process the observation in accordance with current parameter values of the encoder neural network to generate as output parameters of a data-conditional encoding probability distribution over the latent state space (Andoni ¶ 0019 teaches The encoder network 210 may include an input layer 201 including an input node for each of the n first features 102 and an input node for each of the k second features 105 (that is, configured to process the observation in accordance with current parameter values of the encoder neural network to generate as output parameters of a data-conditional encoding probability distribution over the latent state space));
* * *
Greff teaches -
* * *
determining an updated code for the observation based on the parameters of the data-conditional encoding probability distribution (Greff, Fig. 2, teaches Illustration of the TAG framework used for training. (Examiner annotation in text box):

    PNG
    media_image2.png
    353
    776
    media_image2.png
    Greyscale

Greff at p. 2, “2. iTerative Amortized Grouping (TAG) - Grouping,” first paragraph, teaches enable neural networks to split inputs and internal representations into coherent groups that can be processed separately; Greff, Figure 2, caption, teaches that [t]he system learns by denoising its input over iterations using several groups to distribute the representation (that is, based on the parameters of the data-conditional encoding probability distribution). Each group, represented by several panels of the same color (that is, the representation of a “group” is a code), maintains its own estimate of reconstructions zi of the input, and corresponding masks mi, which encode the parts of the input that this group is responsible for representing. These estimates are updated over iterations by the same network, that is, each group and iteration share the weights of the network and only the inputs to the network differ (that is, determining an updated code for the given observation)); and
assigning the updated code to the observation (Greff, right portion of Figure 2, caption, recites [i]n each iteration zi-1 and mi-1 from the previous iteration, are used to compute a likelihood term L(mi-1) and modeling error δzi-1. These four quantities are fed to the parametric mapping to produce zi and mi for the next iteration. During learning, all inputs to the network are derived from the corrupted input as shown here (that is, assigning the updated code to the given observation)).
Regarding claim 8, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described in detail above. 
Andoni teaches -
wherein assigning an initial code to an observation comprises sampling the code from a predetermined probability distribution (Andoni ¶ 0003 teaches [d]uring training, that mean and variance is used to randomly sample from a Gaussian distribution to get an encoded vector (that is, sampling the code from a predetermined probability distribution); Andoni ¶ 0019 teaches each of the clusters has its own Gaussian distribution (that is, a predetermined probability distribution), the VAE may considered a Gaussian Mixture Model (GMM) VAE)).
Regarding claim 9, the combination of Andoni and Greff teaches all of the limitations of claim 8, as described above in detail. 
Andoni teaches -
wherein the predetermined probability distribution is a standard Normal probability distribution (Andoni ¶ 0016 teaches the third neural network 170 may “place” clusters into different parts of latent feature space, where each of those individual clusters follows a distribution (e.g., a Gaussian normal distribution) (that is, a standard Normal probability distribution)).
Regarding claim 11, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described in detail above. 
Andoni teaches - 
wherein the encoder neural network comprises a convolutional neural network (Andoni ¶ 0059 teaches [t]he neural network topology may also indicate the interconnections (e.g., axons or links) between nodes. In some aspects, layers nodes may be used instead of or in addition to single nodes. Examples of layer types include . . . convolutional neural network (CNN) layers (that is, the encoder neural network comprises a convolutional neural network)).
Regarding claim 13, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described above in detail.
Andoni teaches -
wherein the prior neural network comprises a feedforward neural network (Andoni ¶ 0042 teaches topologies of the neural networks 110, 120, 170 may be determined prior to training the neural networks . . . 170 (that is, prior neural network); Andoni ¶ 0046 teaches neural network topology may be "evolved" using a genetic algorithm 310; Of this genetic evolution, Andoni ¶ 0058 teaches [p]arameters of the genetic algorithm 310 may include . . . whether to evolve a feedforward or recurrent neural network, etc. (that is, the prior neural network comprises a feedforward neural network comprises a feedforward neural network)).
9.	Claims 2 and 44 are rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20190228312 to Andoni et al. [hereinafter Andoni] in view of Greff et al., “Tagger: Deep Unsupervised Perceptual Grouping,” arXiv (2016) [hereinafter Greff] and Carl Doersch, “Tutorial on Variational Autoencoders,” arXiv (2016) [hereinafter Doersch].
Regarding claims 2 and 44, the combination of Andoni and Greff teaches all of the limitations of claims 1 and 43, respectively, as described above in detail.
Though Andoni and Greff teach the features of Gaussian distributions of an autoencoder, the combination of Andoni and Greff does not explicitly teach -
the data-conditional encoding probability distribution is a Gaussian distribution with a predetermined covariance matrix; and
the output of the encoder neural network defines a mean vector of the data-conditional encoding probability distribution.
But Doersch teaches -
the data-conditional encoding probability distribution is a Gaussian distribution with a predetermined covariance matrix; and
the output of the encoder neural network defines a mean vector of the data-conditional encoding probability distribution (Doersch at p. 3, “1.1 Preliminaries: Latent Variable Models,” second paragraph, teaches . In [variational autoencoders], the choice of this output distribution (that is, the output of the encoder neural network) is often Gaussian, i.e. P(X|z ; θ) = N (X| f(z; θ), σ2 ∗ I) (that is, a Gaussian distribution). That is, it has mean f(z; θ) and covariance equal to the identity matrix I times some scalar σ (which is a hyperparameter) (that is, the data-conditional encoding probability distribution is a Gaussian distribution with a predetermined covariance matrix; and the output of the encoder neural network defines a mean vector of the data-conditional encoding probability distribution);
[Examiner notes that an “identity matrix” is a matrix containing 1’s on the diagonal, and accordingly, Doersch teaches a predetermined covariance matrix (see Specification ¶ 0082 (“a predetermined covariance matrix (e.g., a diagonal covariance matrix with only 1’s on the diagonal).”)]).
Andoni, Greff and Doersch are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Doersch teaches a deterministic function learned from data. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Andoni and Greff pertaining to latent space cluster mapping operations with the Gaussian distribution and predetermined covariance matrix to produce a mean vector of the encoder of Doersch. 
The motivation for doing so is because of the appeal of VAEs due to being built on top of standard function approximators (neural networks), and can be trained with stochastic gradient descent. VAEs have already shown promise in generating many kinds of complicated data, including handwritten digits, faces, house numbers, CIFAR images, physical models of scenes, segmentation, and predicting the future from static images. (Doersch, Abstract).
10.	Claims 4, 5, 10, 46, and 47 are rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20190228312 to Andoni et al. [hereinafter Andoni] in view of Greff et al., “Tagger: Deep Unsupervised Perceptual Grouping,” arXiv (2016) [hereinafter Greff] and US Published Application 20200387798 to Hewage et al. [hereinafter Hewage].
Regarding claims 4 and 46, the combination of Andoni and Greff teaches all of the limitations of claims 1 and 43, respectively, as described above in detail.
Though Andoni and Greff teach the feature of latent space cluster mapping in connection with autoencoders, the combination of Andoni and Greff does not explicitly teach - 
identifying, from amongst the codes currently assigned to each observation, a predetermined number of codes that are most similar to the updated code assigned to the given observation; and
selecting a code randomly from amongst the identified codes.
But Hewage teaches -
wherein selecting a code assigned to an additional observation comprises:
identifying, from amongst the codes currently assigned to each observation, a predetermined number of codes that are most similar to the updated code assigned to the given observation (Hewage Fig. 2a teaches a Machine Learning technique 200 for use in classifying input data samples 201a (Examiner annotations in text box): 

    PNG
    media_image4.png
    582
    955
    media_image4.png
    Greyscale

Hewage ¶ 0151 teaches [t]he autoencoder 220 includes an adversarial network 210 for use in enforcing the label vector Y 208 (that is, the updated code) to be one-hot or label-like; Hewage ¶ 0134 teaches the discriminator function f learns the distinguish between label vectors Y 206 generated from the generator function g(xn) and categorical samples y′; Hewage ¶ 0152 teaches [t]he adversarial network 210 may be configured, during training, for training the one or more hidden layer(s) 210c and 210d to distinguish (that is, identifying, from amongst the codes) between label vectors y 206 and sample vectors from the categorical distribution of the set of one-hot vectors 210a of the same dimension as the label vector y 206 (that is, the “set of one-hot vectors 201” is identifying . . . a predetermined number of codes that are most similar to the updated code assigned to the given observation). The label vector generator loss function value LGY is associated with label vector y 206 and is for use in training the encoder network 202a to enforce the categorical distribution of the set of one-hot vectors 210a onto the label vector y 206. The size of the label vector y 206 may be based on the number of classes, categories and/or states (that is, given observations) that are to be classified from the input data samples 201a); and
selecting a code randomly from amongst the identified codes (Hewage ¶ 0131 teaches the discriminator network 210 follows a generative adversarial network approach in which the generator is the encoder recurrent neural network 202a and the discriminator network 210 learns to distinguish between samples from a categorical distribution 210a (e.g. random one-hot vectors) (that is, the “categorical distribution 210a” is selecting a code randomly from amongst the identified codes) and the label vector y 206 representation generated by the encoder network 202a).
Andoni, Greff and Hewage are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Hewage teaches an adversarial component of an autoencoder that allows clustering of data in an unsupervised fashion. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Andoni and Greff pertaining to latent space cluster mapping operations with the adversarial component of Hewage.
The motivation for doing so is to allow constrain the latent representations of the sequence-to-sequence network 260 to be label-like, which allows classification/labelling of the latent representations in relation to neural activity encoding one or more bodily variables or combinations thereof. (Hewage ¶ 0120).
Regarding claims 5 and 47, the combination of Andoni, Greff, and Hewage teaches all of the limitations of claims 4 and 46, respectively, as described above in detail. 
Hewage teaches -
wherein identifying the predetermined number of codes further comprises:
determining, for each code of the predetermined number of codes, that the code was not previously selected during a current pass through the training data (Hewage ¶ 0176 teaches illustrates an example clustering 300 of a set of label vector(s) y 206 in which the latent vector z was regularised for an ML technique using an ideal or optimal set of hyperparameters. This example illustrates an idealised scenario in which all the vector labels output by the ML technique that have clustered together in cluster region 302 belong to true state S1, all the vector labels that have clustered together in cluster region 304 belong to true state S2, and all the vector labels that have clustered together in cluster region 306 belong to true state S3. In this case, given that ML technique outputs vector labels that cluster together and which belong to the same states, this may be an indication that temporal correlation has been minimised or even eliminated between adjacent input data samples (e.g. neural sample data sequences). Thus, the ML technique has been trained not to over represent the occurrence of a “temporal pattern” within the input data samples; Hewage ¶ 0119 teaches [t]he number of elements of yk 206 may correspond to the number of unique bodily variable labels that are to be classified. Alternatively, the number of elements of yk 206 may also correspond to the expected number of bodily variable labels that may be found when using an unlabelled bodily variable training dataset).
Regarding claim 10, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described above in detail. 
Andoni teaches -
wherein:
* * *
the output of the prior neural network comprises, for each dimension of the prior probability distribution: (i) a mean parameter, (ii) a standard deviation parameter (Andoni ¶ 0016 teaches the third neural network 170 may output values μp and Σp, as shown at 172, where μp and Σp represent mean (that is, a mean parameter) and variance (that is, a standard deviation parameter) of a distribution (e.g., a Gaussian normal distribution), respectively, and the subscript “p” is used to denote that the values will be used as priors for cluster distance measurement, as further described below.), and (iii) a weighting parameter (Andoni ¶ 0033 teaches calculator/detector 130 may initiate adjustment at one or more of the first neural network 110, the second neural network(s) 120, or the third neural network 170, based on the aggregate loss L. For example, link weights (that is, weighting parameter), bias functions, bias values, etc. may be modified via backpropagation to minimize the aggregate loss L using stochastic gradient descent), for each component of the Gaussian mixture distribution for the dimension (Andoni ¶ 0019 teaches that [b]ecause each of the clusters has its own Gaussian distribution, the VAE may considered a Gaussian Mixture Model (GMM) VAE (that is, each component of the Gaussian mixture distribution for the dimension)).
Though Andoni and Greff teach the feature of a “prior neural network” providing a Gaussian distribution, the combination of Andoni and Greff, however, does not explicitly teach -
the prior probability distribution is a multi-dimensional probability distribution (Hewage ¶ 0008 teaches [t]he plurality of contiguous neural sample data sequences 104a to 104! forms a time series of neural sample data sequences of one or more dimensions. For example, if the one or more neurological signal(s) 102 correspond to a multi-channel neurological signal with M channels, then the plurality of contiguous neural sample data sequences 104a to 104l would be a time series of neural data samples of dimension M (that is, “multi-channel” of M channels is a multi-dimensional probability)) where each dimension of the prior probability distribution is a Gaussian mixture probability distribution (Hewage ¶ 0111 teaches [t]he probability distribution P(z) 214 may be selected from one or more probability distributions or combinations thereof from the group of: a Laplacian distribution; a Gamma distribution; a Gaussian distribution (that is, each dimension of the prior probability distribution is a Gaussian mixture probability distribution)); and
* * *
Andoni, Greff and Hewage are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Hewage teaches an adversarial component of an autoencoder that allows clustering of data in an unsupervised fashion. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Andoni and Greff pertaining to latent space cluster mapping operations with the adversarial component of Hewage.
The motivation for doing so is to allow constrain the latent representations of the sequence-to-sequence network 260 to be label-like, which allows classification/labelling of the latent representations in relation to neural activity encoding one or more bodily variables or combinations thereof. (Hewage ¶ 0120).
11.	Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20190228312 to Andoni et al. [hereinafter Andoni] in view of Greff et al., “Tagger: Deep Unsupervised Perceptual Grouping,” arXiv (2016) [hereinafter Greff] and James M. Joyce, “Kullback-Leibler Divergence,” Int’l Encyclopedia of Statistical Science (2011) [hereinafter Joyce].
Regarding claim 7, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described in detail above. 
Andoni teaches -
wherein the loss function is given by a sum of terms comprising: (i) a Kullback-Leibler divergence measure between the data-conditional encoding probability distribution and the prior probability distribution (Andoni ¶ 0028 teaches loss function determination for an entry should also consider distance from clusters. In a particular aspect, cluster distance is incorporated into loss calculation using two Kullback-Leibler (KL) divergences), . . . .
Though Andoni and Greff teach the feature of a “Kullback-Leibler Divergence” supplemented to include a penalty that encourages small latent sizes, the combination of Andoni and Greff, however, does not explicitly teach -
. . . and (ii) a negative logarithm of the likelihood of the given observation based on the observation probability distribution (Joyce, left column of p. 721, “Kullback-Leibler Divergence,” first full paragraph teaches [t]here is also a connection between KL-divergence and maximum likelihood estimation [which is a negative logarithm of the likelihood of the given observation based on the observation probability distribution]).
Andoni, Greff, and Joyce are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Joyce teaches Kullback-Leibler divergence is an information-based measure of disparity among probability distributions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Andoni and Greff pertaining to latent space cluster mapping operations with the information-based measure of Joyce.
The motivation for doing so is to minimize the Kullback-Leibler divergence from the empirical distribution. (Joyce, left column of p. 721, “Kullback-Leibler Divergence,” first full paragraph). 
12.	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20190228312 to Andoni et al. [hereinafter Andoni] in view of Greff et al., “Tagger: Deep Unsupervised Perceptual Grouping,” arXiv (2016) [hereinafter Greff] and Van den Oord et al., “Conditional Image Generation with PixelCNN Decoders,” NIPS (2016) [hereinafter Oord].
Regarding claim 12, the combination of Andoni and Greff teaches all of the limitations of claim 1, as described above in detail. 
Though Andoni and Greff teaches the features of neural networks in an autoencoder application, the combination of Andoni and Greff, however, does not explicitly teach -
wherein the decoder neural network comprises an autoregressive neural network.
But Oord teaches -
wherein the decoder neural network comprises an autoregressive neural network (Oord at p. 1, “1. Introduction,” second paragraph, teaches adapting and improving a convolutional variant of the PixelRNN architecture. . . . The basic idea of the architecture is to use autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals (that is, the decoder neural network comprises an autoregressive neural network)
[Examiner notes the Specification points to Oord as an example autoregressive neural network for the decoder neural network (Specification ¶ 0087)]).
Andoni, Greff, and Oord are from the same or similar field of endeavor. Andoni teaches jointly training a plurality of neural networks that include a variational autoencoder and another neural network that performs latent space cluster mapping operations. Greff teaches a neural network to group the representations of different objects in an iterative manner as parametric mapping for an autoencoder. Oord teaches an autoregressive neural network as a decoder in an autoencoder. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Andoni and Greff pertaining to latent space cluster mapping operations with the decoder autoregressive neural network of Oord.
The motivation for doing so is because the use of autoregressive connections to model images pixel by pixel of the conditional PixelCNN can serve as a powerful decoder in an image autoencoder. (Oord, Abstract).
Conclusion
13	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(US Patent 9990687 to Kaufhold et al.) teaches embeddings from autoencoders and intermediate hidden representations from a deep analyzer provide a subset of the benefits of deep embeddings.
(US Published Application 20180225812 to DiVerdi et al.) teaches based on the feature vector and determined latent variables, the systems and methods generate a plurality of determined image edits for the digital image, which includes determining a plurality of set of potential image attribute values and selecting a plurality of sets of determined image attribute values from the plurality of sets of potential image attribute values wherein each set of determined image attribute values comprises a determined image edit of the plurality of image edits.
(Makhzani et al., “PixelGAN Autoencoders,” NIPS (2017)) teaches convolutional autoregressive neural network on pixels (PixelCNN) that is conditioned on a latent code, and the recognition path uses a generative adversarial network (GAN) to impose a prior distribution on the latent code. We show that different priors result in different decompositions of information.
14.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.L.S./
Examiner, Art Unit 2122
/BRIAN M SMITH/Primary Examiner, Art Unit 2122