DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
						Priority

Acknowledgment is made of applicant's claim for foreign priority based on an application filed in United Kingdom of Great Britain and Northern Ireland on 13 June, 2019. It is noted, however, that the priority claim has not been entered because it was not filed during the required time period as mentioned in the miscellaneous communication to applicant dated 25 May 2021. Applicant may wish to consider filing a petition to accept an unintentionally delayed claim for priority.

Response to Amendment
This action is in response to submission filed 3 November 2021 for application 16/507,025. Currently claims 1, 5, 8-11, and 14-17 are amended. Claims 19 and 20 are cancelled. Claims 21 and 22 are newly added. Claims 1-18, 21, and 22 are pending and have been examined.
Applicant’s arguments with respect to the objection to the specification has been fully considered and are persuasive. The objection to the specification has been withdrawn.
Applicant’s arguments with respect to the objection to the abstract has been fully considered and are persuasive. The objection to the abstract has been withdrawn.
Applicant’s arguments with respect to the §112(b) rejection of claims 19 and 20 have been fully considered and are persuasive. The §112(b) rejection of claims 19 and 20 have been withdrawn.
Applicant’s arguments with respect to the §101 rejection of claims 19 and 20 have been fully considered and are persuasive. The §101 rejection of claims 19 and 20 have been withdrawn.
Applicant’s arguments with respect to the §101 rejection of claims 1-7, 11-13, and 16-20 have been fully considered and are persuasive. The §101 rejection of claims 1-7, 11-13, and 16-20 have been withdrawn.

Response to Arguments
Applicant’s arguments, see Page 13 of remarks, filed 3 November 2021, with respect to the rejection of claim 1 under 35 USC §103 have been fully considered but are not persuasive. Applicant submits that the proposed combination fails to teach or suggest each and every element of claim 1. Specifically, applicant argues that Kingma and Gal, either alone or in combination, do not disclose at least the limitations c) through f) from claim 1. Examiner respectfully disagrees. The combination of Gal (pages 1 and 3) and Kingma (page 3) teaches c) through f). 
Gal reference states on Page 1, Column 2, Paragraph 1, that an acquisition function (often based on the model’s uncertainty) decides which data points. The acquisition function selects one or more points from a pool of unlabelled data points. Page 3, Column 1, Paragraph 2 states that the uncertainty in the weights induces prediction uncertainty, and Page 3, Column 1, Section 4 states, for example, we might look for images with high predictive variance and choose those to ask an expert to label – in the hope that these will decrease model uncertainty, which under the broadest reasonable interpretation, examiner is interpreting as, c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights, noting that from a pool of unlabelled data points corresponds to from amongst a plurality of potential next features and selects one or more points corresponds to searching for a target feature of the feature vector. Page 3, Paragraph 2 of the Kingma reference states, the datapoint x. A probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x, which under the broadest reasonable interpretation, examiner is interpreting as, of the generative network given the observed data points so far, noting that 
Gal teaches, d) outputting a request to collect a target data point comprising a value of at least the target feature, in page 1 and 3. Page 1, Column 2, Paragraph 1 of Gal states, data points to ask an external oracle for a label, and Page 3, Column 1, Section 4 states, we might look for images with high predictive variance and choose those to ask an expert to label, which under the broadest reasonable interpretation, examiner is interpreting as, d) outputting a request to collect a target data point comprising a value of at least the target feature.
Gal teaches, e) receiving the target data point in response to the request, in Page. Page 1, Column 2, Paragraph 1 states, the selected data points, these are added to the training set, which under the broadest reasonable interpretation, examiner is interpreting as, e) receiving the target data point in response to the request.
Gal teaches, f) further training of the model based on the received target data point on Page 1. Page 1, Column 2, Paragraph 1 states and a new model is trained on the updated training set, which under the broadest reasonable interpretation, examiner is interpreting as, f) further training of the model based on the received target data point.
Furthermore, applicant's arguments, see pages 16 and 17, with respect to the rejection of dependent claims under 35 USC § 103 have been fully considered but they are not persuasive because the dependent claims depend from one of the independent claim 1 and the combination of references cited teach every element of the amended claims as shown above.
Also, applicant's arguments, see page 16, with respect to the newly amended feature “wherein at least one of the observed data points is an incomplete observation, the incomplete observation 
Applicant's arguments, see page 17, with respect to the new independent claims 21 and 22, have been fully considered but are not persuasive. These claims are rejected under 35 USC §103 for the same reasons as set forth above in the rejection of independent claim 1. 

Claim Objections
Claims 12 and 18 are objected to because although they are not rejected, they are dependent on independent claim 1 which is being rejected under 35 USC § 103.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 21 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim do not fall within at least one of the four categories of patent eligible subject matter because the claimed invention is directed to signal per se.
Claim 21 recites "one or more memory devices" for storing device-executable code, wherein the claimed memory devices are not limited to statutory elements. The one or more memory devices are defined/described in the specification in paragraph [036] and the description in the specification is not limited to statutory embodiments as the defined elements are defined as exemplary embodiments without limiting the meaning of memory devices. Accordingly the claimed memory devices is not limited to statutory elements only and thus non-statutory. The claim is not patent eligible.
In order to overcome this rejection Examiner recommends an option of amending Claim 21 to indicate that the memory device is limited to being non-transitory.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 4-9, 11, 13-17, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Kingma et al (Auto-Encoding Variational Bayes, 2014) in view of Gal et al (Deep Bayesian Active Learning with Image Data, 2017).
	
Regarding claim 1
Kingma teaches: A computer-implemented method of training a model comprising one or more neural networks including ([Page 11, Section C] In our example we used relatively simple neural networks) at least a generative network ([Figure 1] Solid lines denote the generative model) the generative network having a latent vector as an input vector and a feature vector as an output vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 7, Paragraph 5 (Section: Likelihood lower bound) We trained generative models (decoders)]. Note: probabilistic decoder corresponds to the generative network, given a code z corresponds to a latent vector as an input vector, x corresponds to feature vector as an output vector) each element of the feature vector representing a different one of a set of features ([Page 6, Section 5] We trained generative models of images from the MNIST and Frey Face datasets3), wherein weights applied by at least some nodes in the generative network are each modelled as a probabilistic distribution ([Page 3, Paragraph 2] we will refer to p (x|z) a probabilistic decoder. [Page 3, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. [Page 11, Section C.1] and where {W1;W2; b1; b2} are the weights); the method comprising: 
a) obtaining one or more observed data points, each comprising a respective subset of feature values, wherein within each subset, each feature value is a value of a corresponding one of a subset of the features in the feature vector ([Page 4, Algorithm 1] Random minibatch of M datapoints (drawn from full dataset). [Page 4, Paragraph 2, below equation (7)] Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches. [Page 4, Paragraph 2, below equation (8)] where the minibatch XM = {x(i}}Mi =1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100); 
b) training the model based on the one or more observed data points to learn values of the weights of the generative network which map the latent vector to the feature vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) as a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 4, Paragraph 4] Subsequently, the sample z(i,l) is then input to function log p (x(i)|z(i,l)), which equals the probability density (or mass) of datapoint x(i) under the generative model, given z(i;l). [Page 11, Section C.1] where {W1;W2; b1; b2} are the weights. Note: Decoder corresponds to generative network, z corresponds to latent vector, and x corresponds to feature vector); 
of the generative network given the observed data points so far ([Page 3, Paragraph 2] the datapoint x. A probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. Note: probabilistic decoder corresponds to the generative network).
However, Kingma does not explicitly disclose: c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights; d) outputting a request to collect a target data point comprising at least the target feature; e) receiving the target data point in response to the request, and f)  further training of the model based on the received target data point.
Gal teaches, in an analogous system: c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights ([Page 1, Column 2, Paragraph 1] an acquisition function (often based on the model’s uncertainty) decides which data points. The acquisition function selects one or more points from a pool of unlabelled data points. [Page 3, Column 1, Paragraph 2] The uncertainty in the weights induces prediction uncertainty. [Page 3, Column 1, Section 4] For example, we might look for images with high predictive variance and choose those to ask an expert to label – in the hope that these will decrease model uncertainty. Note: from a pool of unlabelled data points corresponds to from amongst a plurality of potential next features and selects one or more points corresponds to searching for a target feature of the feature vector);
d) outputting a request to collect a target data point comprising a value of at least the target feature ([Page 1, Column 2, Paragraph 1] data points to ask an external oracle for a label. [Page 3, Column 1, Section 4] we might look for images with high predictive variance and choose those to ask an expert to label);
e) receiving the target data point in response to the request, and ([Page 1, Column 2, Paragraph 1] the selected data points, these are added to the training set);
f)  further training of the model based on the received target data point ([Page 1, Column 2, Paragraph 1] and a new model is trained on the updated training set).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use an acquisition function (often based on the model’s uncertainty) to decide which data points to ask an external oracle for a label. One would have been motivated to do this modification because doing so would give the benefit of often resulting in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) as taught by Gal paragraph [Page 1, Column 2, Paragraph 1].

Regarding claim 2
The system of Kingma and Gal teaches: The method of claim 1 (as shown above).
Gal further teaches: wherein the request comprises a message requesting a human user or group of human users to collect the target data point ([Page 1, Column 2, Paragraph 1] data points to ask an external oracle for a label. An oracle (often a human expert) labels the selected data points).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to ask an external oracle for a label. One would have been motivated to do this modification because doing so would give the benefit of often resulting in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) as taught by Gal paragraph [Page 1, Column 2, Paragraph 1].

Regarding claim 4
Kingma teaches: The method of claim 1, wherein at least some connections between nodes in the generative network are each modelled as a probabilistic distribution ([Page11, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. [Page11, Section C.1] In this case let p (xjz) be a multivariate Bernoulli whose probabilities are computed from z with a fully-connected neural network with a single hidden layer. Note: Decoders correspond to generative network).

Regarding claim 5
Kingma teaches: The method of claim 1, wherein the neural networks of the model further include an inference network having the feature vector as an input vector and the latent vector as an output vector, the inference network and the generative network thus forming an encoder and decoder respectively of a variational auto ([Page 3, Paragraph 2] In this paper we will therefore also refer to the recognition model q (zjx) as a probabilistic encoder, since given a datapoint x it produces a distribution (e.g. a Gaussian) over the possible values of the code z from which the datapoint x could have been generated. In a similar vein we will refer to p (xjz) as a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page11, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. Note: Encoder corresponds to inference network, decoder corresponds to generative network, z corresponds to latent vector, and x corresponds to feature vector), and the training further comprises learning weights of the inference network which map the feature vector to the latent vector ([Page 3, Paragraph 2] From a coding theory perspective, the unobserved variables z have an interpretation as a latent representation or code. In this paper we will therefore also refer to the recognition model q (zjx) as a probabilistic encoder, since given a datapoint x it produces a distribution (e.g. a Gaussian) over the possible values of the code z from which the datapoint x could have been generated. [Page 11, Section C.2] where {W3;W4;W5; b3; b4; b5} are the weights. Note: Encoder corresponds to inference network, z corresponds to latent vector, and x corresponds to feature vector).

Regarding claim 6
Kingma teaches: The method of claim 5, wherein the weights applied by at least some nodes in the inference network are each modelled as a probabilistic distribution ([Page 11, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. In our example we used relatively simple neural networks, namely multi-layered perceptrons (MLPs). For the encoder we used a MLP with Gaussian output. [Page 11, Section C.2] where {W3;W4;W5; b3; b4; b5} are the weights. Note: Encoder corresponds to inference network and neural network is a network of artificial neurons or nodes).


Regarding claim 7
Kingma teaches: The method of claim 5, wherein at least some connections between nodes in the inference network are each modelled as a probabilistic distribution ([Page 11, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. In our example we used relatively simple neural networks, namely multi-layered perceptrons (MLPs). For the encoder we used a MLP with Gaussian output.  Note: Encoder corresponds to inference network and neural network is a network of artificial neurons or nodes).

Regarding claim 8
The system of Kingma and Gal teaches: The method of claim 1 (as shown above).
Gal further teaches: comprising: wherein at least one of the observed data points is an incomplete observation, the incomplete observation comprising values for at least one, but not all, of the features in the feature vector ([Page 1, Column 2, Paragraph 1] unlabelled data points. Note: Unlabelled data points correspond to incomplete observation because the label is missing).

Regarding claim 9
The system of Kingma and Gal teaches: The method of claim 1 (as shown above).
Gal further teaches: comprising repeating a)-f) over multiple iterations. each iteration including the received target data point from a previous iteration amongst the observed data points ([Page 1, Column 2, Paragraph 1] An oracle (often a human expert) labels the selected data points, these are added to the training set, and a new model is trained on the updated training set. This process is then repeated, with the training set increasing in size over time. Note: Training set corresponds to the observed data points and selected data points corresponds to the target data points).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to train a new model on the updated training set and repeating the process. One would have been motivated to do this modification because doing so would give the benefit of often resulting in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) as taught by Gal paragraph [Page 1, Column 2, Paragraph 1].
Regarding claim 11
The system of Kingma and Gal teaches: The method of claim 1 (as shown above).
Gal further teaches: wherein a measure of uncertainty comprises a measure of a difference between i) an entropy of said distribution given the observed data points and  ii) an expectation of the entropy given the observed data points and the potential feature ([Page 3, Column 2, Paragraph 1] Formula).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use the formula on Page 3, Column 2, Paragraph 1. One would have been motivated to do this modification because doing so would give the benefit of choosing pool points that are expected to maximize the information gained about the model parameters as taught by Gal paragraph [Page 3, Column 2, Paragraph 1].

Regarding claim 13
Kingma teaches: The method of claim 1, wherein each of the observed data points is labelled with a classification ([Page 6, Section 5] We trained generative models of images from the MNIST and Frey Face datasets3. Note: MNIST corresponds to a labelled dataset).

Regarding claim 14
The system of Kingma and Gal teaches: The method of claim 13 (as shown above).
([Page 1, Column 2, Paragraph 1].  labels the selected data points, these are added to the training set) 
and repeating a)-f) over multiple iterations, each iteration including the received target data point from a previous iteration of the multiple iterations amongst the observed data points ([Page 1, Column 2, Paragraph 1] This process is then repeated, with the training set increasing in size over time. Note: Repeated corresponds to multiple iterations).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to label the selected data points and train the new model on the updated training set and repeating the process. One would have been motivated to do this modification because doing so would give the benefit of often resulting in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) as taught by Gal paragraph [Page 1, Column 2, Paragraph 1].

Regarding claim 15
The system of Kingma and Gal teaches: The method of claim 14 (as shown above).
Gal further teaches: further comprising using the model to predict the classification of a further data point after one or more of the iterations of said further training ([Page 6, Column 2, Section 5.5] Our task is to classify the images as malignant or benign. [Page 7, Column 2, Paragraph 1] The process is repeated until all pool points have been exhausted. Note: Pool point corresponds to data point and repeated corresponds to iterations).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to classify the image and repeat the process. One would have been motivated to do this modification because doing so would give the benefit of assessing the proposed technique with a real world test case as taught by Gal paragraph [Page 6, Column 2, Section 5.5].

Regarding claim 16
The system of Kingma and Gal teaches: The method of claim 13 (as shown above).
Gal further teaches: wherein a measure of uncertainty comprises a measure of a combination of: - a difference between i) an entropy of said distribution given the observed data points and ii) an expectation of the entropy given the observed data points and the potential feature; and - a measure of conditional mutual information in the weights of the generative network and a predicted classification given the observed data points and the potential feature ([Page 3, Column 2, Paragraph 1] Formula).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use the formula on Page 3, Column 2, Paragraph 1. One would have been motivated to do this modification because doing so would give the benefit of choosing pool points that are expected to maximize the information 

Regarding claim 17
The system of Kingma and Gal teaches: The method of claim 16 (as shown above).
Gal further teaches: wherein said measure of conditional mutual information comprises a measure of RP where: <<FORMULA>> where H is the entropy, E is the expectation, p is said distribution, Xo is a vector of the observed data points, 0 is a vector of the weights of the generative network, and xia is the feature value of feature d of the feature vector in data point i, and yi is a predicted classification ([Page 3, Column 2, Paragraph 1] Formula).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use the formula on Page 3, Column 2, Paragraph 1. One would have been motivated to do this modification because doing so would give the benefit of choosing pool points that are expected to maximize the information gained about the model parameters as taught by Gal paragraph [Page 3, Column 2, Paragraph 1].

Regarding claim 21
Kingma teaches: One or more memory devices storing device-executable code for training a model comprising one or more neural networks including ([Page 11, Section C] In our example we used relatively simple neural networks) at least a ([Figure 1] Solid lines denote the generative model) the generative network having a latent vector as an input vector and a feature vector as an output vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 7, Paragraph 5 (Section: Likelihood lower bound) We trained generative models (decoders)]. Note: probabilistic decoder corresponds to the generative network, given a code z corresponds to a latent vector as an input vector, x corresponds to feature vector as an output vector) each element of the feature vector representing a different one of a set of features ([Page 6, Section 5] We trained generative models of images from the MNIST and Frey Face datasets3), wherein weights applied by at least some nodes in the generative network are each modelled as a probabilistic distribution ([Page 3, Paragraph 2] we will refer to p (x|z) a probabilistic decoder. [Page 3, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. [Page 11, Section C.1] and where {W1;W2; b1; b2} are the weights); the code when executed by a processing apparatus direct the processing apparatus to perform operations comprising ([Page 7, Figure 2 legend] Computation took around 20-40 minutes per million training samples with a Intel Xeon CPU running at an effective 40 GFLOPS):
a) obtaining one or more observed data points, each comprising a respective subset of feature values, wherein within each subset, each feature value is a value of a corresponding one of a subset of the features in the feature vector ([Page 4, Algorithm 1] Random minibatch of M datapoints (drawn from full dataset). [Page 4, Paragraph 2, below equation (7)] Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches. [Page 4, Paragraph 2, below equation (8)] where the minibatch XM = {x(i}}Mi =1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100); 
b) training the model based on the one or more observed data points to learn values of the weights of the generative network which map the latent vector to the feature vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) as a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 4, Paragraph 4] Subsequently, the sample z(i,l) is then input to function log p (x(i)|z(i,l)), which equals the probability density (or mass) of datapoint x(i) under the generative model, given z(i;l). [Page 11, Section C.1] where {W1;W2; b1; b2} are the weights. Note: Decoder corresponds to generative network, z corresponds to latent vector, and x corresponds to feature vector); 
of the generative network given the observed data points so far ([Page 3, Paragraph 2] the datapoint x. A probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. Note: probabilistic decoder corresponds to the generative network).
However, Kingma does not explicitly disclose: c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights; d) outputting a request to collect a target data point comprising at least the target feature; e) receiving the target data point in response to the request, and f)  further training of the model based on the received target data point.
([Page 1, Column 2, Paragraph 1] an acquisition function (often based on the model’s uncertainty) decides which data points. The acquisition function selects one or more points from a pool of unlabelled data points. [Page 3, Column 1, Paragraph 2] The uncertainty in the weights induces prediction uncertainty. [Page 3, Column 1, Section 4] For example, we might look for images with high predictive variance and choose those to ask an expert to label – in the hope that these will decrease model uncertainty. Note: from a pool of unlabelled data points corresponds to from amongst a plurality of potential next features and selects one or more points corresponds to searching for a target feature of the feature vector);
d) outputting a request to collect a target data point comprising a value of at least the target feature ([Page 1, Column 2, Paragraph 1] data points to ask an external oracle for a label. [Page 3, Column 1, Section 4] we might look for images with high predictive variance and choose those to ask an expert to label);
e) receiving the target data point in response to the request, and ([Page 1, Column 2, Paragraph 1] the selected data points, these are added to the training set);
f)  further training of the model based on the received target data point ([Page 1, Column 2, Paragraph 1] and a new model is trained on the updated training set).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use an acquisition function (often based on the model’s uncertainty) to decide which data points to ask an external oracle for a label. 

Regarding claim 22
Kingma teaches: A computer for training a model comprising one or more neural networks including ([Page 11, Section C] In our example we used relatively simple neural networks) at least a generative network ([Figure 1] Solid lines denote the generative model) the generative network having a latent vector as an input vector and a feature vector as an output vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 7, Paragraph 5 (Section: Likelihood lower bound) We trained generative models (decoders)]. Note: probabilistic decoder corresponds to the generative network, given a code z corresponds to a latent vector as an input vector, x corresponds to feature vector as an output vector) each element of the feature vector representing a different one of a set of features ([Page 6, Section 5] We trained generative models of images from the MNIST and Frey Face datasets3), wherein weights applied by at least some nodes in the generative network are each modelled as a probabilistic distribution ([Page 3, Paragraph 2] we will refer to p (x|z) a probabilistic decoder. [Page 3, Section C] In variational auto-encoders, neural networks are used as probabilistic encoders and decoders. [Page 11, Section C.1] and where {W1;W2; b1; b2} are the weights); the computer comprising: at least one processor; at least one memory device storing computer-executable code that, in ([Page 7, Figure 2 legend] Computation took around 20-40 minutes per million training samples with a Intel Xeon CPU running at an effective 40 GFLOPS):
a) obtain one or more observed data points, each comprising a respective subset of feature values, wherein within each subset, each feature value is a value of a corresponding one of a subset of the features in the feature vector ([Page 4, Algorithm 1] Random minibatch of M datapoints (drawn from full dataset). [Page 4, Paragraph 2, below equation (7)] Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likelihood lower bound of the full dataset, based on minibatches. [Page 4, Paragraph 2, below equation (8)] where the minibatch XM = {x(i}}Mi =1 is a randomly drawn sample of M datapoints from the full dataset X with N datapoints. In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100); 
b) train the model based on the one or more observed data points to learn values of the weights of the generative network which map the latent vector to the feature vector ([Page 3, Paragraph 2] In a similar vein we will refer to p (x|z) as a probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. [Page 4, Paragraph 4] Subsequently, the sample z(i,l) is then input to function log p (x(i)|z(i,l)), which equals the probability density (or mass) of datapoint x(i) under the generative model, given z(i;l). [Page 11, Section C.1] where {W1;W2; b1; b2} are the weights. Note: Decoder corresponds to generative network, z corresponds to latent vector, and x corresponds to feature vector); 
([Page 3, Paragraph 2] the datapoint x. A probabilistic decoder, since given a code z it produces a distribution over the possible corresponding values of x. Note: probabilistic decoder corresponds to the generative network).
However, Kingma does not explicitly disclose: c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights; d) output a request to collect a target data point comprising at least the target feature; e) receive the target data point in response to the request, and f)  further train of the model based on the received target data point.
Gal teaches, in an analogous system: c) from amongst a plurality of potential next features to observe, searching for a target feature of the feature vector which maximizes a measure of expected reduction in uncertainty in a distribution of said weights ([Page 1, Column 2, Paragraph 1] an acquisition function (often based on the model’s uncertainty) decides which data points. The acquisition function selects one or more points from a pool of unlabelled data points. [Page 3, Column 1, Paragraph 2] The uncertainty in the weights induces prediction uncertainty. [Page 3, Column 1, Section 4] For example, we might look for images with high predictive variance and choose those to ask an expert to label – in the hope that these will decrease model uncertainty. Note: from a pool of unlabelled data points corresponds to from amongst a plurality of potential next features and selects one or more points corresponds to searching for a target feature of the feature vector);
d) output a request to collect a target data point comprising a value of at least the target feature ([Page 1, Column 2, Paragraph 1] data points to ask an external oracle for a label. [Page 3, Column 1, Section 4] we might look for images with high predictive variance and choose those to ask an expert to label);
e) receive the target data point in response to the request, and ([Page 1, Column 2, Paragraph 1] the selected data points, these are added to the training set);
f)  further train of the model based on the received target data point ([Page 1, Column 2, Paragraph 1] and a new model is trained on the updated training set).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma to incorporate the teachings of Gal to use an acquisition function (often based on the model’s uncertainty) to decide which data points to ask an external oracle for a label. One would have been motivated to do this modification because doing so would give the benefit of often resulting in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) as taught by Gal paragraph [Page 1, Column 2, Paragraph 1].

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Kingma et al (Auto-Encoding Variational Bayes, 2014) in view of Gal et al (Deep Bayesian Active Learning with Image Data, 2017) and further in view of Wu et al (Active learning with label correlation exploration for multi-label image classification, 2017).
Regarding claim 3
The system of Kingma and Gal teaches: The method of claim 1 (as shown above).

Wu teaches, in an analogous system: wherein the request comprises a signal to an automated process requesting the automated process to collect the target data point ([Page 578, Column 2, Paragraph 2] we introduced self-training [22] into multilabel active learning, incorporating automated annotation into the traditional active learning process. Note: incorporating automated annotation corresponds to a signal to an automated process requesting the automated process to collect the target data point).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma and Gal to incorporate the teachings of Wu to use automated annotation. One would have been motivated to do this modification because doing so would give the benefit of significantly reducing the annotation workload of human experts and outperform other state-of-the-art approaches as taught by Wu paragraph [Page 578, Column 2, Paragraph 2].

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Kingma et al (Auto-Encoding Variational Bayes, 2014) in view of Gal et al (Deep Bayesian Active Learning with Image Data, 2017) and further in view of Ivanov et al (Universal Conditional Machine, 2018).
Regarding claim 10
The system of Kingma and Gal teaches: The method of claim 9 (as shown above).
 ([Page 1, Column 2, Paragraph 1] This process is then repeated, with the training set).
However, the system of Kingma and Gal does not explicitly disclose: to impute one or more of the feature values of a further data point.
Ivanov teaches, in an analogous system:   to impute one or more of the features values of a further data point ([Page 6, Section 4.2, Paragraph 2] Our model provides more flexible way of feature imputation. It allows to generate a number of different imputations for each object from the distribution on the missed features).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Kingma and Gal to incorporate the teachings of Ivanov to use imputation. One would have been motivated to do this modification because doing so would give the benefit of increasing the quality of classifier or regressor which is built to solve the problem as taught by Ivanov paragraph [Page 6, Section 4.2, Paragraph 7].

Conclusion

Regarding claims 12 and 18, no prior art was found.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Gheyas et al (2010) discloses a neural network-based framework for the reconstruction of incomplete data sets.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  


Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHAITANYA RAMESH JAYAKUMAR whose telephone number is (571)272-3369. The examiner can normally be reached Mon-Fri 7am-1pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on (571)272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like 



/CHAITANYA R JAYAKUMAR/   Examiner, Art Unit 2128                                                                                                                                                                                         
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127