DETAILED ACTION
This communication is responsive to application 15/822,884 with RCE filed 10/29/2021.
Claim set 04/30/2021 consideration includes two independent claims 1 and 15, claim status is
Amended claims: 1-5, 8-10, 15, 17 and 19
Canceled claims: 6-7, 11-14, 16 and 18
Original claims: 20-28

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after allowance or after an Office action under Ex Parte Quayle, 25 USPQ 74, 453 O.G. 213 (Comm'r Pat. 1935). Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, prosecution in this application has been reopened pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/29/2021 has been entered.

Information Disclosure Statement
As required by M.P.E.P. 609(c), the applicant’s submissions of the Information Disclosure Statements dated 10/29/2021 and 11/08/2021 are acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending. As required by M.P.E.P. 

Response to Remarks
6.	This application has been transferred within the office. New examiner has reviewed the full case history and considered the additional references presented in both IDS listings. The IDS(s) present an additional 38 references and remarks do not point out the significance of new art being material to patentability in relation to applicant’s withdrawal of the application from a condition of allowance. Reconsideration of patentability on all statutory grounds is given as follows: 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
All pending claims are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1-28 of copending Application No. 17/516,230. Although the claims at issue are not identical, they are not patentably distinct from each other because they both recite substantially identical subject matter. This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented. See the following comparison table: 
Instant Application: 15/822,884
Copending Application: 17/516,230
Claim 1. 
A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels drawn from a plurality of binary labels, the method executed by at least one processor in communication with at least one memory and comprising: 




instantiating an approximating posterior system in the at least one memory, the approximating posterior system modeling an approximating posterior distribution, the approximating posterior distribution approximating a true posterior distribution corresponding to the prior distribution which models label noise as a set of stochastic label flips on the plurality of binary labels, each label flip indicating a belief in the correctness of a corresponding label, the approximating posterior system operable to generate a posterior probability for one or more label flips for the observed element given the one or more observed labels; and 

training at least one of the recognition system and the approximating posterior system based on a training dataset, the prior distribution, and the approximating posterior distribution.




wherein at least one of the prior and posterior distribution comprises a spike, the spike defining a high-probability state where no labels are flipped

A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels drawn from a plurality of binary labels, the method executed by at least one processor in communication with at least one memory and comprising: 




instantiating an approximating posterior system in the at least one memory, the approximating posterior system approximating a true posterior distribution corresponding to a prior distribution which models label noise as a set of stochastic label flips on the plurality of binary labels, each label flip indicating a belief in the correctness of a corresponding label, the approximating posterior system operable to generate a posterior probability for one or more label flips for the observed element given the one or more observed labels; and 



training at least one of the recognition system and the approximating posterior distribution based on a training dataset, the prior distribution, and the approximating posterior distribution.

Claim 6. 
wherein the at least one of the prior and posterior distributions comprises a spike, the spike defining a high-probability state.


The method of claim 1 wherein the method further comprises 

instantiating a shared transformation system in the at least one memory, the shared transformation system operable to receive the one or more observed 2elements and to generate a representation of the observed elements based on one or more shared parameters, wherein the recognition system and the approximating posterior system are operable to receive the representation of the observed elements as input.
Claim 2. 
A method according to claim 1 wherein the method further comprises 

instantiating a shared transformation system in the at least one memory, the shared transformation system operable to receive the one or more observed elements and to generate a representation of the observed elements based on one or more shared parameters, wherein the recognition system and the approximating posterior system are operable to receive the representation of the observed elements as input.
Claim 3. 
The method of claim 2 wherein training at least one of the recognition system and the approximating posterior system comprises 

training the shared transformation system to generate the one or more shared parameters.
Claim 3. 
A method according to claim 2 wherein training at least one of the recognition system and the approximating posterior system comprises 

training the shared transformation system to generate the one or more shared parameters.
Claim 4. 
The method of claim 1 wherein at least one of the recognition system and the approximating posterior system comprises a deep neural network.

The method of claim 1 further comprising 


selecting the prior distribution, and selecting the approximating posterior distribution, wherein at least one of selecting the prior distribution, and selecting the approximating posterior distribution comprises selecting a predetermined distribution.

A method according to claim 1 wherein at least one of the recognition system and the approximating posterior system comprises a deep neural network.

A method according to claim 1 wherein at least one of 

selecting the prior distribution, selecting the recognition model, and selecting the approximating posterior distribution comprises selecting a predetermined distribution or model.

The method of claim 1 wherein the approximating posterior distribution comprises 

a directed graphical model, the directed graphical model operable to generate, given an initial class, a probability of label change for each of one or more remaining classes.
Claim 8. 
A method according to claim 1 wherein the approximating posterior distribution comprises 

a directed graphical model, the directed graphical model operable to generate, given an initial class, a probability of label change for each of one or more remaining classes.
Claim 9. 
The method of claim 1 wherein the method further comprises 

selecting one of the prior distribution and the approximating posteriorReply to Office Action dated February 11, 2021 distribution based on a previous selection of the other one of the prior distribution and the approximating posterior distribution.
Claim 9. 
A method according to claim 1 wherein the method further comprises 

selecting one of the prior distribution and the approximating posterior distribution based on a previous selection of the other one of the prior distribution and the approximating posterior distribution.
Claim 10. 
The method of claim 9 wherein 

the prior distribution comprises a Boltzmann distribution with a spike and the approximating posterior distribution comprises a factorial distribution with a spike.
Claim 10. 
A method according to claim 9 wherein 

the prior distribution comprises a Boltzmann distribution with a spike and the approximating posterior distribution comprises a factorial distribution with a spike.
Claim 15. 
A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels, the method executed by at least one processor in communication with at least one memory and comprising: 

instantiating an inference system in the at least one memory, the inference system modelling a joint probability distribution over a plurality of variables, the plurality of variables comprising the observed labels and one or more true labels, the joint probability distribution conditional on the input dataset; 

instantiating an auxiliary system in the at least one memory, the auxiliary system modelling an auxiliary probability distribution over the plurality of variables independently of the input dataset; 



training at least one of the representation system and the inference system 






based on a first modified lower bound defined over at least a noisy subset of the input dataset, the noisy subset comprising input data and associated observed labels, the first modified lower bound comprising an original lower bound based on the joint probability distribution 





and an additional term based on the auxiliary probability distribution.

A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels, the method executed by at least one processor in communication with at least one memory and comprising: 

instantiating an inference system in the at least one memory, the inference system modelling a joint probability distribution over a plurality of variables, the plurality of variables comprising the observed labels and one or more true labels, the joint probability distribution conditional on the input dataset; 38 

H:\AAA EFS Uploads\240105.58601C1\58601C1_Application.docxinstantiating an auxiliary system in the at least one memory, the auxiliary system modelling an auxiliary probability distribution over the plurality of variables independently of the input dataset; 



training at least one of the representation system and the inference system based on a training dataset, the joint distribution, and the auxiliary probability distribution.

Claim 16.
based on a first modified lower bound defined over at least a noisy subset of the input dataset, the noisy subset comprising input data and associated observed labels, the first modified lower bound based at least in part on the auxiliary probability distribution.

Claim 18. 
and an additional term based on the auxiliary probability distribution.

The method of claim 15 wherein training at least one of the representation system and the inference system comprises 

training at least one of the representation system and the inference system based on a modified clean lower bound defined over a clean subset of the input dataset, the clean subset disjoint from the noisy subset and comprising input data, associated observed labels, and associated true labels, the modified clean lower bound based at least in part on the auxiliary probability distribution.
Claim 17. 
The method of claim 16 wherein training at least one of the representation system and the inference system comprises 

training at least one of the representation system and the inference system based on a modified clean lower bound defined over a clean subset of the input dataset, the clean subset disjoint from the noisy subset and comprising input data, associated observed labels, and associated true labels, the modified clean lower bound based at least in part on the auxiliary probability distribution.
Claim 19. 
The method of claim 15 wherein training at least one of the representation system and the inference system comprises 

determining a gradient over the additional term and training at least one of the representation system and the inference system based on the gradient.
Claim 19. 
The method of claim 18 wherein training at least one of the representation system and the inference system comprises 

determining a gradient over the additional term and 39 H:\AAA EFS Uploads\240105.58601C1\58601C1_Application.docxtraining at least one of the representation system and the inference system based on the gradient.
Claim 20. 
The method of claim 19 wherein determining the gradient over the additional term comprises 

determining a first gradient over a positive phase of the additional term and a second gradient over a negative phase of the additional term, the first gradient determined analytically and the second gradient determined by approximation.


The method of claim 19 wherein determining the gradient over the additional term comprises 

determining a first gradient over a positive phase of the additional term and a second gradient over a negative phase of the additional term, the first gradient determined analytically and the second gradient determined by approximation.
Claim 21. 
The method of claim 20 wherein training at least one of the representation system and the inference system comprises 

optimization of an objective function by expectation maximization.
Claim 21. 
The method of claim 20 wherein training at least one of the representation system and the inference system comprises 

optimization of an objective function by expectation maximization.
Claim 22. 
The method of claim 15 wherein the auxiliary probability distribution is fixed while training at least one of the representation system and the inference system.
Claim 22. 
The method of claim 15 wherein the auxiliary probability distribution is fixed while training at least one of the representation system and the inference system.
Claim 23. 
The method of claim 22 comprising training the auxiliary probability distribution based on information independent of the training dataset prior to training at least one of the representation system and the inference system.
Claim 23. 
The method of claim 22 comprising training the auxiliary probability distribution based on information independent of the training dataset prior to training at least one of the representation system and the inference system.
Claim 24. 
The method of claim 15 wherein training at least one of the representation system and the inference system comprises 

training at least one of the representation system and the inference system based on an optimization function comprising a first term based on the joint probability distribution and independent of the auxiliary probability distribution and based on a second term based on the auxiliary probability distribution, 

wherein at least one of the first and second terms is scaled by a scaling factor.
Claim 24. 
The method of claim 15 wherein training at least one of the representation system and the inference system comprises 

training at least one of the representation system and the inference system based on an optimization function comprising a first term based on the joint probability distribution and independent of the auxiliary probability distribution and based on a second term based on the auxiliary probability distribution, 

wherein at least one of the first and second terms is scaled by a scaling factor.
Claim 25. 
The method of claim 24 comprising 

setting the scaling factor to a first value for a first iteration of training and setting the scaling factor to a second value for a second iteration of training.
Claim 25. 
The method of claim 24 comprising 

setting the scaling factor to a first value for a first iteration of training and setting the scaling factor to a second value for a second iteration of training.
Claim 26. 
The method of claim 25 wherein the scaling factor is monotonically decreasing during training.
Claim 26. 
The method of claim 25 wherein the scaling factor is monotonically decreasing during training.
Claim 27. 
The method of claim 15 wherein the inference system models the joint probability distribution based on an undirected graphical model, 

the undirected graphical model comprising one or more undirected edges representing one or more 

The method of claim 15 wherein the inference system models the joint probability distribution based on an undirected graphical model, 

the undirected graphical model comprising one or more undirected edges representing one or more 

The method of claim 27 wherein the inference system models the auxiliary probability distribution based on an auxiliary undirected graphical model, 

the auxiliary undirected graphical model comprising an undirected subgraph of the undirected graphical model of the inference system.
Claim 28. 
The method of claim 27 wherein the inference system models the auxiliary probability distribution based on an auxiliary undirected graphical model, 

the auxiliary undirected graphical model comprising an undirected subgraph of the undirected graphical model of the inference system.



Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

All pending claims are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. In determining whether the claims are subject matter eligible, the examiner applies guidance under MPEP 2106.
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—all claims fall within one of the four statutory categories as all claims are method/process.
Step 2A, prong one: Does the claim recite an abstract idea, law of nature or natural phenomenon? Yes—the claims, under the broadest reasonable interpretation, recite an abstract idea. In this case, claims fall within the enumerated grouping of abstract idea being “Mathematical Concepts”, but for the recitation of generic computer components. In particular, the claims recite: 
Claim 1:
“instantiating a recognition system… by generating a classification probability for the at least one observed element” (mathematical calculation, i.e. probabilistic); 
“instantiating an approximating posterior system… approximating a true posterior distribution corresponding to the prior distribution which models label noise as a set of stochastic label flips on the plurality of binary labels… generate a posterior probability” (mathematical calculations, i.e. approximating with stochastic probability) 
“training at least one of… based on a training dataset, the prior distribution, and the approximating posterior distribution” (mathematical relationship, i.e. distributional)
Claim 15: 
“instantiating an inference system… modelling a joint probability distribution over a plurality of variables…” (mathematical calculations, mathematical relationships)
instantiating an auxiliary system… modelling an auxiliary probability distribution over the plurality of variables…” (mathematical calculations, mathematical relationships) 
“instantiating a representation system… characterize the one or more interactions between the plurality of variables of the joint distribution” (mathematical relationships)
“training at least one of… based on the joint probability distribution and an additional term based on the auxiliary probability distribution” (mathematical relationships)
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—a practical application is not integrated into the judicial exception because the additional elements are as follows: 
Limitations are performed by “processor in communication with at least one memory”. It is important to note that a general purpose computer that applies a judicial exception, such as an abstract idea, by use of conventional computer functions does not qualify as a particular machine per MPEP 2106.05(b). These elements are recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using a generic computer component per MPEP 2106.05(f).

Limitations further recite “models label noise as a set of stochastic label flips”. This is considered part of the abstract idea, mathematical concepts, as is detailed by the dozens of equations disclosed by the instant specification which even states “arbitrary machine learning recognition systems may be incorporated into the architecture” emphasis arbitrary, to point. The limitation amounts to mere instructions to apply an exception per MPEP 2106.05(f). There is no evidence of improvement to the functioning of a computer such as MNIST/CIFAR benchmarking to support any unexpected results.
Further, the different systems do not require different hardware but are rather partitioned functions of a software system where systems may all use the same shared memory, see MPEP 2106.05(b). Limitations further recite that training may comprise “modified lower bound defined over at least a noisy subset”. This is considered part of the abstract idea, mathematical concepts, as the specification details the bound according to equations.
Accordingly, these additional elements do not integrate the abstract idea into a practical application. The claims are directed to the abstract idea of mathematical concepts.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of using a processor and memory to 
Moreover, claims recite the following limitations which are considered with respect to guidance per MPEP2106.05(a)-(h).
Limitations of “instantiating… elements of the input” (MPEP 2106.05(g) pre-solutionary)
Limitation of “models label noise as a set of stochastic label flips” (MPEP 2106.05(f) apply it)
Limitation of “training” (MPEP 2106.05(d)(b) routine). The term training is widely used as routine in the art, see MPEP 2106.05(d). Supporting evidence of elements being known is indicated by way of Omidshafiei et al., “Hierarchical Bayesian Noise Inference for Robust Real-time Probabilistic Object Classification” at Figs 1, 8-9.
The claims are not patent eligible. This rejection applies to both independent claims 1 and 15 as well as to all pending dependent claims 2-5, 8-10, 17 and 19-28. The dependent claims when analyzed as a whole are held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitation(s) fail(s) to establish that the claim(s) is/are not directed to an abstract idea, as they recite further embellishment of the judicial exception.
Dependent claims 2-3 disclose shared parameters for generative training which is akin to a type of joint learning or transfer learning. This is considered part of the abstract idea as mathematical manipulation of variables via parameterization. The language concerning “generate a representation” is not clearly defined and therefore does not provide meaningful limitation that carries technical significance, see MPEP 2106.05(e)
Dependent claim 4 discloses a deep neural network to perform the functionality. Various DNN architectures have been deployed extensively throughout the art well before the effective filing date circa 2016. The limitation amounts a field of use, deep networks, and thin limitation being no more than just apply it, see MPEP 2106.05(h)(f). 

Dependent claim 8 discloses a directed graph for multiclass label flip probabilities. The directed graph is a technological environment for calculating probabilities, see MPEP 2106.05(h). Evidence of such elements being known is per Bornschein et al., “Bidirectional Helmholtz Machines”.
Dependent claim 9 discloses prior/posterior selection based on selection the other prior/ posterior. This is simply a mixture ratio as the second selection is inherent from the first selection. Accordingly, the claim does not provide meaningful limitation, see MPEP 2106.05(e).
Dependent claim 10 discloses a Boltzmann distribution and factorial with spike. The limitation is considered part of the abstract idea, mathematical concepts as it pertains to decomposing probabilistic distributions.
Dependent claim 17 discloses training with bounding on data subsets clean/noisy using auxiliary probabilistic distribution. The term auxiliary here simply refers to a software partition, hence the specification referring to “partition function”, i.e., not auxiliary in the sense of hardware. Regardless, the language of aux and/or clean is non-functional descriptive matter. The operational limitation is setting limits on bounding condition which is math, the judicial exception.
Dependent claims 19-21 disclose gradient determination with first/second gradients and a function for expectation maximization. Expectation maximization is log-likelihood and gradient is differentiable. The limitation makes abundantly clear that claims are mathematical calculation.
Dependent claims 22-23 discloses training while holding aux distribution fixed and independent of other training data. This is akin to freezing a parameter, however there is no indication of temporal resolution or any reference to time for which certain parts of training are fixed. There is no indication that this improves the functioning of a computer and the functionality is considered an extra-solutionary 
Dependent claims 24-26 disclose a scaling factor for iterative values monotonically decreasing during the training of combined probability distributions and optimization function. This is considered part of the abstract idea, mathematical concepts.
Dependent claims 27-28 disclose undirected graphical model with edges and subgraph for modeling variables of the joint distribution. An undirected graphical model is simply a restricted Boltzmann machine by another name. The class of models amounts to a field of use or technological environment to apply the judicial exception, see MPEP 2106.05(h)(f). For evidentiary support of such elements being known, see Wang et al., “Paired Restricted Boltzmann Machine for Linked Data”.
Taken alone, their additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over: 
Ororbia et al., “Online Semi-Supervised Learning with Deep Hybrid Boltzmann Machines and Denoising Autoencoders” (ICLR, arXiv: 1511.06964v7) hereinafter Ororbia, in view of 
Awasthi et al., “Efficient Learning of Linear Seperators under Bounded Noise” (CMU, arXiv: 1503.03594v1) hereinafter Awasthi.
With respect to claim 1, Ororbia teaches: 
A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels drawn from a plurality of binary labels, the method executed by at least one processor in communication with at least one memory and {Ororbia discloses semi-supervised joint learning with deep hybrid Boltzmann and denoising autoencoders, see [Sect.3] and Alg.1 labeled/unlabeled input. Samples are maintained [P.7 Sect3.3.3] suggesting computer hardware to process the algorithmic stochastic calculations} comprising: 
instantiating a recognition system in the at least one memory, the recognition system operable to classify elements of the input dataset by generating a classification probability for the at least one observed element {Ororbia [P.5 Sect3.1] “recognition network” Eq.(10) objective function with KL-divergence is probabilistic where recognition is a classification task and instantiating is initialization of weights/parameters, see [P.8] Alg.1 The technique is generative, replete}; 
instantiating an approximating posterior system in the at least one memory {Ororbia discloses posteriors being mean-field (QMF) of labeled and unlabeled together with recognition (Qrec), [P.8] Alg.1 per bold function “posteriors Qreclab… posteriors QMFlab and QMFunlab”. See also [P.3 Sect3.1 ¶3] “model the joint distribution”}, 
training at least one of the recognition system and the approximating posterior distribution based on a training dataset, the prior distribution, and the approximating posterior distribution {Ororbia [P.6-7 Sect3.3-3.3.1] discloses training/learning as a joint/hybrid optimization comprising the recognition network (DHBM) and an autoencoder (DHDA). The distribution of data trained on is with respect to labeled/unlabeled data (semi-supervised) with conditional probabilities, see appendix C.2 [P.16] “training set” regarding label/unlabeled data as used in Algorithm 1, Fig 1 “bi-directional”}.
However, Ororbia does not appear to disclose “label flips”. 
Awasthi teaches
the approximating posterior system approximating a true posterior distribution corresponding to a prior distribution which models label noise as a set of stochastic label flips on the plurality of binary labels, each label flip indicating a belief in the correctness of a corresponding label, the approximating posterior system operable to generate a posterior probability for one or more label flips for the observed element given the one or more observed labels {Awasthi teaches Massart noise bounding conditions for learning half-spaces where probability is calculated over label flips, see [P.4 ¶2]. Conditional label probabilities are evaluated with respect to parameter β and Bayesian classification evaluates any continuous distribution with a pdf over binary {0,1}, see [P.9 Thrm.2], [P.6 Last¶]. Further, w* which is “cleaned” by minimizing error/loss, see [P.7 ¶1]. Finally, the functionality describes noise rate for every example controlled by an adversary [P.2 ¶2-3]};
	Both Ororbia and Awasthi are directed to modeling distributions with noise thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to utilize the Massart noise for label flips disclosed by Awasthi in combination with joint learning of Ororbia for the motivation that of providing stronger guarantees under noise conditions for probabilistic label flips with noise rate, noting “it is well known that under this model, we can get faster rates… achieving arbitrarily small excess error for learning linear seperators” (Awasthi [P.2 ¶2]).

With respect to claim 2, the combination of Ororbia and Awasthi teaches the method of claim 1 wherein the method further comprises 
instantiating a shared transformation system in the at least one memory, the shared transformation system operable to receive the one or more observed 2elements and to generate a representation of the observed elements based on one or more shared parameters, wherein the recognition system and the approximating posterior system are operable to receive the representation of the observed elements as input {Ororbia [P.3] Fig 1 illustrates “fully bi-directional” where weights are calculated both bottom-up and top-down with SBEN “ensemble at inference time”. The optimization is utilized for an architecture of “joint” learning, i.e. shared [P.6]. The received input continually updates per “evolutionary nature of the input distribution” [P.10 ¶6]}.

With respect to claim 3, the combination of Ororbia and Awasthi teaches the method of claim 2 wherein 
	training at least one of the recognition system and the approximating posterior system comprises training the shared transformation system to generate the one or more shared parameters {Ororbia [P.6-7 Sect3.3-3.3.1] training being the joint parameterization, joint hybrid framework }.

With respect to claim 4, the combination of Ororbia and Awasthi teaches the method of claim 1 wherein 
	at least one of the recognition system and the approximating posterior system comprises a deep neural network {Ororbia Figs 1-2 deep hybrid generative models, e.g. Boltmann or Autoencoder}.

With respect to claim 5, the combination of Ororbia and Awasthi teaches the method of claim 1 further comprising 
	selecting the prior distribution, and selecting the approximating posterior distribution, wherein at least one of selecting the prior distribution, and selecting the approximating posterior distribution comprises selecting a predetermined distribution {Ororbia [P.15 Sect.B] discloses “we use Gibbs sampling” is selecting further detailed over distributions per [P.16] Alg.4. Additionally, [P.16 ¶2] “Model selection was performed” with the distribution belonging to selected model}.

Claims 6-7 (Canceled).

With respect to claim 9, the combination of Ororbia and Awasthi teaches the method of claim 1 wherein the method further comprises 
selecting one of the prior distribution and the approximating posterior 3distribution based on a previous selection of the other one of the prior distribution and the approximating posterior distribution {Ororbia discloses “mixing” with respect to Gibbs sampling during MCMC learning and/or mini-batching, both of which are selecting the respective distribution. Mixing ratio specifies selection of one based on the other. See [P.7 Last¶], [P.9 Last¶], [P.15 Sect.B ¶1]}.

With respect to claim 10, the combination of Ororbia and Awasthi teaches the method of claim 9 wherein 
	the prior distribution comprises a Boltzmann distribution with a spike and the approximating posterior distribution comprises a factorial distribution with a spike {Ororbia [P.8 Alg.1] “approximate factorial posteriors” of the specified distributions and with Boltzmann replete, hence Title. A spike is statewise inference per [P.15 Sect.B ¶1] “each time we make a call to update the hybrid model’s parameters, we sample a new state xt+1” and with stochastic maximum calculated [P.16 Alg.4]. See also [P.7 Sect3.3.2] “running the mean-field equations for a single step”}.

Claims 11-14 (Canceled).

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Ororbia in view of 
Omidshafiei et al., “Hierarchical Bayesian Noise Inference for Robust Real-time Probabilistic Object Classification” (arXiv: 1605.01042v2) hereinafter Omidshafiei.
With respect to claim 8, the combination of Ororbia and Awasthi teaches the method of claim 1. Omidshafiei teaches wherein 
	the approximating posterior distribution comprises a directed graphical model, the directed graphical model operable to generate, given an initial class, a probability of label change for each of one or more remaining classes {Omidshafiei Fig 3(b) illustrates directed graphical model with per-class noise parameters. The modeling is a hierarchical multi-class/label classification [P.6-7 SectV.A] and a posterior is calculated probabilistically for each class, see e.g. [P.4-5] Eqs. (10-11), (21-22), [P.3 Sect.III].}.
	Omidshafiei is directed to predictive modeling with noisy labels thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to utilize the hierarchical multi-class classifier of Omidshafiei in combination with Ororbia for the Ii must be inferred simultaneously modeling the noise distribution associated with that class” (Omidshafiei [P.3 Sect.III ¶1]), and with further benefit being “second advantage is that is simplifies the posterior probability calculations used within Markov chain Monte Carlo (MCMC)” (Omidshafiei [P.4 ¶1]) where MCMC is similarly utilized per Ororbia [P.15 Sect.B ¶1].

Claim 15, 19 and 22-25 are rejected under 35 U.S.C. 103 as being unpatentable over Ororbia in view of 
Achille et Soatto, “Information Dropout: Learning Optimal Representations Through Noise” (ICLR, arXiv: 1611.01353v1), hereinafter Achille.
With respect to claim 15, Ororbia teaches: 
A method for instantiating a machine learning system for classifying one or more observed elements of an input dataset, the input dataset labelled with one or more observed labels, the method executed by at least one processor in communication with at least one memory and {Ororbia discloses semi-supervised joint learning with deep hybrid Boltzmann and denoising autoencoders, see [Sect.3] and Alg.1 labeled/unlabeled input. Samples are maintained [P.7 Sect3.3.3] suggesting computer hardware to process the algorithmic stochastic calculations} comprising: 
instantiating an inference system in the at least one memory, the inference system modelling a joint probability distribution over a plurality of variables, the plurality of variables comprising the observed labels and one or more true labels, the joint probability distribution conditional on the input dataset {Ororbia [P.3 ¶4] “model the joint distribution” where joint learning over variable/parameters is detailed [P.6-7 Sect3.3-3.3.1] and with dataset being semi-supervised (labeled/unlabeled) utilizing conditional distributions introduced e.g., [P.2 ¶3] “conditional p(y|x)”. Instantiating is initialization of parameter/weights, see [P.8 Alg.1] “initial model parameters”}; 38 
H:\AAA EFS Uploads\240105.58601C1\58601C1_Application.docxinstantiating an auxiliary system in the at least one memory, the auxiliary system modelling an auxiliary probability distribution over the plurality of variables independently of the input dataset {Ororbia [P.4] “auxiliary network” described “we propose augmenting the DHBM architecture with a co-model, or separate auxiliary network… effectively fused with the deep architecture of interest” which suggests independent modeling for fusing architectures, and where data being “probability distribution” is noted per same page}; 
instantiating a representation system in the at least one memory, the representation system operable to characterize the one or more interactions between the plurality of variables of the joint distribution {Ororbia [P.6] Fig 2 illustrates system interaction between variables, e.g. µ, v, and ey. [P.6] “all layers of the hybrid architecture are globally coordinated during learning” and further suggests a normalization/regularization which is a characterization of variable interaction [P.9 ¶2], [P.16 ¶1]}; 
However, Ororbia does not teach a “modified lower bound”
Achille teaches: 
training at least one of the representation system and the inference system based on a first modified lower bound defined over at least a noisy subset of the input dataset, the noisy subset comprising input data and associated observed labels, the first modified lower bound comprising an original lower bound based on the joint probability distribution and an additional term based on the auxiliary probability distribution {Achille [P.6 Sect.5] equation details training a VAE with loss being a log-likelihood with KL for “minimizing the negative variational lower-bound” emphasis lower-bound. The conditional probabilistic distributions are generalized through use of variable z being latent. The functionality addresses noise vis-à-vis information dropout, hence Title. A simplified version of the formula is noted per [P.2 Sect.2] as true posterior from prior. See also [P.4 Eq.2]}.
	Achille is directed to predictive modeling of distributions with noise thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date variational lower-bound of Achille in combination with the variational autoencoder of Ororbia’s joint training optimization as a substitution of loss functions for the same class of model among known techniques to yield predictable results and/or for the motivation of bounding where “we allow the parameters of the noise to change on a per-sample basis” (Achille [P.2 Last¶], [P.4 ¶2]).

Claim 16 (Canceled).
Claim 18 (Canceled).

With respect to claim 19, the combination of Ororbia and Achille teaches the method of claim 15 wherein 
	training at least one of the representation system and the inference system comprises determining a gradient over the additional term and training at least one of the representation system and the inference system based on the gradient {Achille describes optimization of the terms whereby [P.6 ¶1] “loss can be optimized easily using stochastic gradient descent… back-propagate the gradient”}.

With respect to claim 22, the combination of Ororbia and Achille teaches the method of claim 15 wherein 
	the auxiliary probability distribution is fixed while training at least one of the representation system and the inference system {Achille [P.4 Sect.4 Last¶] discloses where “fix this noise distribution… fix a prior distribution”. See also Ororbia [P.5 ¶2] fixed}.

With respect to claim 23, the combination of Ororbia and Achille teaches the method of claim 22 wherein 
	training the auxiliary probability distribution based on information independent of the training dataset prior to training at least one of the representation system and the inference system {Ororbia [P.7 Sect3.3.1 ¶2] “pre-training” and/or [P.10 ¶1] “self-training” is training a dataset prior to training the remaining distribution}.

With respect to claim 24, the combination of Ororbia and Achille teaches the method of claim 15 wherein 
	training at least one of the representation system and the inference system comprises training at least one of the representation system and the inference system based on an optimization function comprising a first term based on the joint probability distribution and independent of the auxiliary probability distribution and based on a second term based on the auxiliary probability distribution, wherein at least one of the first and second terms is scaled by a scaling factor {Achille scaling factor is [P.2 Last¶] “we allow a scaling constant β in front of the KL-divergence term, which can be changed freely” where β is training via loss optimization for dropout per at least [P.4 Eq.2] or [P.6 Eq.2] and which is considered with respect to a second term}.

With respect to claim 25, the combination of Ororbia and Achille teaches the method of claim 24 comprising 
	setting the scaling factor to a first value for a first iteration of training and setting the scaling factor to a second value for a second iteration of training {Achille [P.4 ¶1] “iterate the process to obtain incrementally improved representations” by [P.2 Last¶] “rescaling… choosing a different scale for the KL-divergence term can indeed lead to improvements” again at [P.7 Last¶] “suitably chosen scaling factor”, [P.6 Sect.6 ¶1]}.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Ororbia and Achille in view of Awasthi. 
With respect to claim 17, the combination of Ororbia and Achille teaches the method of claim 15. Awasthi teaches wherein 
	training at least one of the representation system and the inference system comprises training at least one of the representation system and the inference system based on a modified clean lower bound {Achille teaches training with lower bound as previously addressed, [P.6 Sect.5]} 
	However, the combination of Ororbia and Achille does not detail “clean subset disjoint from the noisy subset”.
	Awasthi teaches:
defined over a clean subset of the input dataset, the clean subset disjoint from the noisy subset and comprising input data, associated observed labels, and associated true labels, the modified clean lower bound based at least in part on the auxiliary probability distribution {Achille per [P.4 ¶2] “we refer to D’ as the ‘noisy’ distribution and to distribution D over instances (x,sign(w*·x)) as the ‘clean’ distribution” are independent/disjoint subsets for respective learned half-spaces whereupon [P.11 ¶3-4] details “our lower bound… Theorem 3” calculates hinge-loss over respective class of halfspace}.
	One having ordinary skill in the art would have considered it obvious prior to the effective filing date to modify a modified lower bound such as that of Achille as being limited to the respective clean halfspace of Awasthi for the motivation of calculating excess classifier error over specified samples (Awasthi [P.11 ¶4 Thrm.3]).

Claims 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Ororbia and Achille in view of: 
Xu et Ou “Joint Stochastic Approximation Learning of Helmholtz Machines” (ICLR, Tsinghua, arXiv: 1603.06170v1) hereinafter Xu.
With respect to claim 20, the combination of Ororbia and Achille teaches the method of claim 19. Xu teaches wherein 
	determining the gradient over the additional term comprises determining a first gradient over a positive phase of the additional term and a second gradient over a negative phase of the additional term, the first gradient determined analytically and the second gradient determined by approximation {Xu [P.2 ¶3] “The key is to formulate two gradients as expectations” where [P.4 ¶3] “estimated gradients for θ and φ” are wrt corresponding simultaneous partial derivatives of [P.3] Eqs. (2)-(3) for joint stochastic approximation having auxiliary inference model “pairing a generative model pθ(x,h) with an auxiliary inference model qφ(h|x)”}.
	Xu is directed to predictive modeling with bounded distribution thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to utilize the first and second gradient optimization of Xu in combination for the motivation of “optimizing the marginal log-likelihood and simultaneously minimizing the inclusive KL-divergence” (Xu [P.3 ¶1]).

With respect to claim 21, the combination of Ororbia, Achille and Xu teaches the method of claim 20 wherein 
	training at least one of the representation system and the inference system comprises optimization of an objective function by expectation maximization {Xu [P.3] Eq.(2) symbol E is expectation and described as expectation values in relation to root function Eq.(1) [P.2 Sect2.1]}. 
However, Xu discloses minimization. 
Ororbia discloses expectation maximization is per at least [P.16 Alg.4] “stochastic maximum likelihood” where symbol ∇ denotes gradient(s). One of ordinary skill in the art would have considered it .

Claims 26 is rejected under 35 U.S.C. 103 as being unpatentable over Ororbia and Achille in view of: 
Korenkevych et al., “Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines” (arXiv: 1611.04528v1) hereinafter Korenkevych.
With respect to claim 26, the combination of Ororbia and Achille teaches the method of claim 25. Korenkevych teaches wherein 
	the scaling factor is monotonically decreasing during training {Korenkevych [P.4 ¶2] “The time-dependent weightings A/B are monotonically decreasing”. See training iteratively Figs 9 and 13 at [P.19] with foot note “scaling factor”}.
	Korenkevych is directed to model training over distributions thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to specify the scaling factor of Achille as weighted monotonic decrease per Korenkevych as applying known techniques to known methods to yield predictable results and/or in order to further “assess the accuracy of the learned models” (Korenkevych [P.12 ¶1]) so as to achieve “better fit” [P.19 ¶1].

Claims 27-28 are rejected under 35 U.S.C. 103 as being unpatentable over Ororbia and Achille in view of: 
Benedetti et al., “Quantum-assisted learning of graphical models with arbitrary pairwise connectivity” (NASA-Ames, arXiv: 1609.02542v1) hereinafter Benedetti.
With respect to claim 27, the combination of Ororbia and Achille teaches the method of claim 15 wherein 
	the inference system models the joint probability distribution based on an undirected graphical model {Ororbia [P.3 Sect3.1 ¶2-3] “restricted Boltzmann machine… joint distribution” where restricted Boltzmann machine is an undirected graphical model, Fig 1, [P.2 Sect.2 ¶3]}, 
	However, Ororbia does not expressly disclose “edges”
	Benedetti teaches: 
the undirected graphical model comprising one or more undirected edges representing one or more interactions between the plurality of variables of the joint distribution {Benedetti [P.2 Sect.II ¶2] “interaction graph G = (V,E) where V and E are the set of vertices and edges” described for Boltzmann probability distribution. Further, [P.5 Sect.V.A ¶1] “learn the joint probability distribution” as illustrated Fig 6 chimera graph with edges}.
	Benedetti is directed to model estimations over probabilistic distributions thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to utilize the graphing of Benedetti for the Boltzmann distributions of Ororbia for the motivation of using a powerful tool for interacting in which one may “model the joint probability distributions of all the variables of interest” (Benedetti [P.1 Sect.1 ¶1]).

With respect to claim 28, the combination of Ororbia, Achille and Benedetti teaches the method of claim 27 wherein 
	the inference system models the auxiliary probability distribution based on an auxiliary undirected graphical model, the auxiliary undirected graphical model comprising an undirected subgraph of the undirected graphical model of the inference system {Benedetti discloses [P.3 ¶3,5] i of the original probabilistic model we associate a subgraph… replicate the state of each logical variable si inside the corresponding subgraph” of Fig 6 chimera graph}.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935. The examiner can normally be reached M-F 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CHASE P. HINCKLEY/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126