DETAILED ACTION
This action is in response to claims filed 11 December, 2017 for application 15/838000 filed 11 December, 2017. Currently claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The listing of references in the specification is not a proper information disclosure statement.  37 CFR 1.98(b) requires a list of all patents, publications, or other information submitted for consideration by the Office, and MPEP § 609.04(a) states, "the list may not be incorporated into the specification but must be submitted in a separate paper."  Therefore, unless the references have been cited by the examiner on form PTO-892, they have not been considered.
The Information disclosure statement lists over a 1000 documents without any indication of relevance. The IDS has been considered insofar as it is acknowledged as being present.
Specification
The disclosure is objected to because it contains an embedded hyperlink and/or other form of browser-executable code. Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly 
Claims 1-3, 8, 10-13, 18 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (Deep Model Based Domain Adaptation for Fault Diagnosis) in view of Kwok (Moderating the Outputs of Support Vector Machine Classifiers) and Si et al. (Bregman Divergence-Based Regularization for Transfer Subspace Learning).

Regarding claims 1, 11 and 20, Lu discloses: A method of modelling data, comprising: 
training an objective function of a … classifier, based on a set of labeled data, to derive a set of classifier weights (“Therefore, our DAFD method is designed to only utilize the labeled samples to obtain the transferable features extraction method, which can also be applied to the unlabeled samples. The final DAFD objective function consists of the following three parts:
1) the basic loss term for building DNN, the form is just the same as (1);
2) the MMD term for reducing the discrepancy between distributions of Ds and Dt ;
3) the weights regularization term for reinforcing the representative features of the original data.” P2299 §IV.A ¶1, see also fig 1, note: the DNN is the first classifier)
approximating a marginalized loss function for an autoencoder (“In general, domain data are composed of data space X and a marginal probability distribution P(X), e.g., {X, P(X)}, where X ∈ X. If source domain Ds and target domain Dt are different, they have different data spaces and marginal distributions, that is Xs ≠ Xt and P(Xs ) ≠ P(Xt ).” P2298 §III.A.2), “DNN is a kind of artificial neural network, which is constructed with a desirable complex architecture using the deep learning technique [7]. The core idea is to train one layer with an unsupervised representation learning algorithm at a time, and then use the output of this layer as the input for the next layer. This process can be executed iteratively until a desirable structure approached. In this paper, the autoencoder is chosen as the basic single-layer representation model for stacking the deep architecture [28]” p2298 §III.B, “(MMD) is a criterion to estimate the discrepancy between distributions [29]. Compared to many parametric criteria, such as Kullback–Leibler divergence, MMD can estimate a nonparametric distance between various distributions and avoid to calculate the intermediate density, which is always a nontrivial task.” P2298 §III.C); and,
automatically classifying unlabeled data using a compact classifier according to the marginalized loss function (“In this paper, the goal of domain adaptation learning is to learn a transformation function F only with labeled samples from Ds and Dt , which satisfies P(F(Xs)) = P(F(Xt)) and P(Ys|F(Xs)) = P(Yt|F(Xt)). Then, the prediction function built on Ds could be used to classify the unlabeled samples of Dt.” 2298 §III.A.4), see also fig 1).

However, Lu does not explicitly disclose: linear classifier;

as a Bregman divergence, based on the posterior probability distribution on the set of classifier weights learned from the linear classifier.

Kwok teaches: linear classifier (“In this paper, we extend the use of moderated outputs to the support vector machine (SVM) by making use of a relationship between SVM and the evidence framework. The moderated output is more in line with the Bayesian idea that the posterior weight distribution should be taken into account upon prediction, and it also alleviates the usual tendency of assigning overly high confidence to the estimated class memberships of the test patterns.” abstract);
defining a posterior probability distribution on the set of classifier weights of the linear classifier (“Despite its many successes, SVM relies on only one weight solution (indirectly represented by the set of Lagrangian multipliers) in making predictions. However, from a Bayesian perspective, the weights of any machine, even after learning, still take a certain posterior distribution. Using just one weight solution as the sole representative thus neglects posterior uncertainty in the weights. This often leads to more extreme predicted outputs during testing, and in turn indicates an overly high confidence that the pattern belongs to a particular class. Under the Bayesian framework, the proper way to handle these weight parameters is by marginalization, which involves integrating them out from the conditional distribution. MacKay called the resultant marginalized output the moderated output, and this has been shown to be better in the context of neural networks [19].” p1018 §1 ¶2);
based on the posterior probability distribution on the set of classifier weights learned from the linear classifier (“Despite its many successes, SVM relies on only one weight solution (indirectly represented by the set of Lagrangian multipliers) in making predictions. However, from a Bayesian perspective, the weights of any machine, even after learning, still take a certain posterior distribution. Using just one weight solution as the sole representative thus neglects posterior uncertainty in the weights. This often leads to more extreme predicted outputs during testing, and in turn indicates an overly high confidence that the pattern belongs to a particular class. Under the Bayesian framework, the proper way to handle these weight parameters is by marginalization, which involves integrating them out from the conditional distribution. MacKay called the resultant marginalized output the moderated output, and this has been shown to be better in the context of neural networks [19].” p1018 §1 ¶2).

Lu and Kwok are both in the same field of endeavor of classifiers and are analogous. Lu teaches a system comprising a classifier with weights and an autoencoder that result in a compact trained classifier. Kwok teaches a linear classifier with a posterior probability distribution on weights and marginalizing data for a neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date to substitute the DNN of Lu with the linear SVM linear classifier as taught by Kwok to have an autoencoder and linear classifier based training for a 

Si teaches: as a Bregman Divergence (“For practical scenarios with the cross-domain setting, these aforementioned subspace learning algorithms perform poorly because of the violation of the sample i.i.d. assumption. As a consequence, it is urgent to reconsider subspace learning algorithms by taking the distribution difference between the training and the testing samples into account. In this paper, we propose a novel subspace learning framework by developing a Bregman divergence-based regularization, which can be integrated into the existing subspace learning algorithms, e.g., FLDA.” Si P930 ¶3).

Lu, Kwok and Si are all in the same field of endeavor of classifiers and are analogous. Lu teaches a system comprising a classifier with weights and an autoencoder that result in a compact trained classifier. Kwok teaches a linear classifier with a posterior probability distribution on weights and marginalizing data for a neural network. Si teaches the integration of the Bregman divergence into learning algorithms. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the classifier of Lu and Kwok with the Bregman divergence regularization to yield regularized classifiers using the Bregman divergence measure. One would have been motivated as the Bregman divergence improves results when training and testing data are not independent and identically distributed (Si p930 ¶3).
claims 2 and 12, Lu does not explicitly disclose: The method according to claim 1, wherein the data comprises data selected from the group consisting of semantic data, text data, and text documents.

Kwok teaches: wherein the data comprises data selected from the group consisting of semantic data, text data, and text documents (“Finally, notice that our results also concur with others [8], [22] that the SVM is very suitable for text categorization. The micro-averaged break-even point obtained here is 85%, which is among the best known results in this collection.8 Favorable
results using the SVM have also been reported on the older Reuters-22 173 collection [9].” P1030 ¶3).

Lu, Kwok and Si are all in the same field of endeavor of classifiers and are analogous. Lu teaches a system comprising a classifier with weights and an autoencoder that result in a compact trained classifier. Kwok teaches a linear classifier with a posterior probability distribution on weights and marginalizing data for a neural network that implements text categorization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the trained classifier of Lu, Kwok and Si to be able to classify text as taught by Kwok to yield predictable results. One would be motivated to do so as linear classifiers (SVMs) are proficient at text categorization (Kwok P1030 ¶3)

claims 3 and 13, Lu discloses: The method according to claim 1, wherein the autoencoder comprises a neural network, wherein said training comprises training the neural network (“DNN is a kind of artificial neural network, which is constructed with a desirable complex architecture using the deep learning technique [7]. The core idea is to train one layer with an unsupervised representation learning algorithm at a time, and then use the output of this layer as the input for the next layer. This process can be executed iteratively until a desirable structure approached. In this paper, the autoencoder is chosen as the basic single-layer representation model for stacking the deep architecture [28]” p2298 §III.B).

Regarding claims 8 and 18, Lu does not explicitly disclose: The method according to claim 1, wherein the Bregman divergence is determined assuming that all data samples induce a loss.

Si teaches: wherein the Bregman divergence is determined assuming that all data samples induce a loss (“The objective function FðWÞ is designed for specific applications, e.g., it minimizes the data classification loss in the selected subspace according to different assumptions or intuitions. For example, FLDA selects a subspace, where the trace ratio of the within-class scatter matrix and the between-class scatter matrix is minimized.” P931 ¶1 SI).



Regarding claim 10, Lu does not explicitly disclose: The method according to claim 1, wherein the posterior probability distribution on the set of classifier weights is estimated using with a Markov chain Monte Carlo method.

Kwok teaches: wherein the posterior probability distribution on the set of classifier weights is estimated using with a Markov chain Monte Carlo method (“In this paper, we have focused on classification problems. Extension to regression problems for the calculation of error bars is also straight-forward. Potentially, the ability to transform SVM output to posterior class probability estimate can yield a lot more benefits [20], such as compensating for different prior probabilities and using an ensemble of networks. These will be investigated in the future. Finally, as this work is based on the evidence framework, variability in the hyperparameter [i.e., the regularization parameter in (5)] is ignored. This issue will be studied and the application of other Bayesian techniques, like Markov Chain Monte Carlo methods [44] and the Gaussian process [45], will also be considered.” P1030 §VI ¶2).

Lu, Kwok and Si are all in the same field of endeavor of classifiers and are analogous. Lu teaches a system comprising a classifier with weights and an autoencoder that result in a compact trained classifier. Kwok teaches a linear classifier with a posterior probability distribution on weights and marginalizing data for a neural network that implements text categorization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the trained classifier of Lu, Kwok and Si to use other Bayesian posterior probability techniques such as Markov chain Monte Carlo to yield predictable results.

Claims 4, 5, 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Kwok and Si and further in view of Rezende et al. (Stochastic Backpropagation and Approximate Inference in Deep Generative Models).

Regarding claims 4 and 14, Lu does not explicitly disclose: The method according to claim 1, wherein the autoencoder comprises a denoising autoencoder.

Rezende teaches: wherein the autoencoder comprises a denoising autoencoder (“Denoising autoencoders (DAE) (Vincent et al., 2010) introduce a random corruption to the encoder network and attempt to minimize the expected reconstruction error under this corruption noise with additional regularisation terms. In our variational approach, the recognition distribution q(ξ|v) can be interpreted as a stochastic encoder in the DAE setting” p8 §relation to denoising auto-encoders).

Lu, Kwok, Si and Rezende are all in the same field of endeavor of classifiers and are analogous. Rezende further teaches the use of known denoising autoencoders. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the linear classifier and autoencoder based training for a compact classifier as taught by Lu, Kwok and Si with the denoising autoencoder of Rezende to yield a more robust system. One would be motivated to use denoising autoencoder as they minimize error with additional regularization (Rezende p8).

Regarding claim 5 and 15, Lu discloses: comprises a neural network trained according to stochastic gradient descent training using randomly selected data samples, wherein a gradient is calculated using back propagation of errors (p2299 §IV.B wherein backpropagation and gradient descent are used to train the autoencoder).

	However, Lu does not explicitly disclose: wherein the denoising autoencoder is denoised stochastically.

Rezende teaches: wherein the denoising autoencoder is denoised stochastically
(“Denoising autoencoders (DAE) (Vincent et al., 2010) introduce a random corruption to the encoder network and attempt to minimize the expected reconstruction error under this corruption noise with additional regularisation terms. In our variational approach, the recognition distribution q(ξ|v) can be interpreted as a stochastic encoder in the DAE setting. There is then a direct correspondence between the expression for the free energy (11) and the reconstruction error and regularization terms used in denoising auto-encoders (c.f. equation (4) of Bengio et al. (2013)). Thus, we can see denoising auto-encoders as a realisation of variational inference in latent variable models.” p8 §relation to denoising auto-encoders).

Lu, Kwok, Si and Rezende are all in the same field of endeavor of classifiers and are analogous. Rezende further teaches the use of known denoising autoencoders. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the linear classifier and autoencoder based training for a compact classifier as taught by Lu, Kwok and Si with the denoising autoencoder of Rezende to yield a more robust system. One would be motivated to use denoising autoencoder as they minimize error with additional regularization (Rezende p8).

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Kwok and Si and further in view of Wang et al. (Multi-task support vector machines for feature selection with shared knowledge discovery).

 claims 6 and 16, Lu does not explicitly disclose: The method according to claim 1, wherein said training comprises training the objective function of the linear classifier with a bag of words, wherein the linear classifier comprises a support vector machine classifier with squared hinge loss and L2 regularization.

Wang teaches: wherein said training comprises training the objective function of the linear classifier with a bag of words, wherein the linear classifier comprises a support vector machine classifier with squared hinge loss and L2 regularization (“Feature selection with shared Information (FSSI) [31]: It employs the least squared loss function and ℓ2,1-norm regularization for each single task, and a joint trace-norm minimization to exploit shared information among multiple tasks.” P750.5., “Specifically, we propose to use hinge loss function with the ℓ2;1- norm regularization to learn feature selection matrix for each task. Sparsity of the feature selection matrix helps us to discover the correlations within each task while choosing distinctive features” p751 §5 ¶1).

Lu, Kwok, Si and Wang are all in the same field of endeavor of classifiers and are analogous. Wang further teaches the use of known bag-of-words, squared hinge loss and L2 regularization techniques. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the linear classifier and autoencoder based training for a compact classifier as taught by Lu, Kwok and Si with the specific linear classifier techniques as taught by Wang to yield a more robust system. One .

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Kwok and Si and further in view of Wang et al. (Multi-task support vector machines for feature selection with shared knowledge discovery).
 
Regarding claims 7 and 17, Lu does not explicitly disclose: The method according to claim 1, wherein said training comprises training the objective function of the linear classifier with a bag of words, wherein the linear classifier comprises a Logistic Regression classifier.

Le teaches: wherein said training comprises training the objective function of the linear classifier with a bag of words, wherein the linear classifier comprises a Logistic Regression classifier (“After being trained, the paragraph vectors can be used as features for the paragraph (e.g., in lieu of or in addition to bag-of-words). We can feed these features directly to conventional machine learning techniques such as logistic regression, support vector machines or K-means.” P3 second to last paragraph right column).

Lu, Kwok, Si and Le are all in the same field of endeavor of classifiers and are analogous. Wang further teaches the use of known bag-of-words and logistic regression techniques. It would have been obvious to one of ordinary skill in the art before the .

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Kwok and Si and further in view of Wang et al. (Multi-task support vector machines for feature selection with shared knowledge discovery).

Regarding claim 9, Lu does not explicitly disclose: The method according to claim 1, wherein the posterior probability distribution on the set of classifier weights is estimated using with a Laplace approximation, wherein the Laplace approximation stochastically estimates the set of classifier weights using a covariance matrix constrained to be diagonal.

Melacci teaches: wherein the posterior probability distribution on the set of classifier weights is estimated using with a Laplace approximation, wherein the Laplace approximation stochastically estimates the set of classifier weights using a covariance matrix constrained to be diagonal (“The primal solution of the LapSVM problem is based on an L2 hinge loss, that establishes a direct connection to the Laplacian Regularized Least Square Classifier (LapRLSC) (Belkin et al., 2006). We discuss the similarities between primal LapSVM and LapRLSC and we show that the proposed fast solution can be straightforwardly applied to LapRLSC.” P1151 ¶2, “L is the graph Laplacian associated to S, given by L = D −W, where W is the adjacency matrix of the data graph (the entry in position i, j is indicated with wi j) and D is the diagonal matrix with the degree of each node (i.e., the element dii from D is dii = ∑nj=1wi j).” p1151 §2 ¶1).

Lu, Kwok, Si and Melacci are all in the same field of endeavor of classifiers and are analogous. Wang further teaches the use Laplace approximation estimation of classifier weights. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the linear classifier and autoencoder based training for a compact classifier as taught by Lu, Kwok and Si with the specific linear classifier techniques as taught by Melacci to yield a more robust system. One would be motivated to use these techniques as Laplacian linear classifiers show state of the art performance in semi-supervised learning (Melacci abstract).

Regarding claim 19, it is directed to largely the same subject matter as claim 9 and 10 and is rejected under the same rationale.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246.  The examiner can normally be reached on M-F: 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)-272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/ERIC NILSSON/           Primary Examiner, Art Unit 2122