DETAILED ACTION
1.	This communication is in response to Application No. 16/429,425 filed on June 3, 2019 in which claims 1-18 and 20 are presented for examination.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
3.	The information disclosure statement submitted on 06/03/2019 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
4.	Claim 4 is objected to because of the following informalities: 
The limitation “an inference neural network step” does not clarify what comprises a “step” in the context of mutual information loss. Further, it is not apparent how this “step” differs from Claim 5, in which an “inference neural network” is recited.
The limitation “a Gaussian mixture model step” does not clarify what comprises a “step” in the context of mutual information loss. Further, it is not apparent how this “step” differs from Claim 5, in which a “Gaussian mixture model” is recited.
Claim 20 is objected to because of the following informalities:
The limitation recites “The method of Claim 17”, however, Claim 17 is a system claim. Instead, it should be corrected to read “The system of Claim 17”.
Appropriate correction is required.

Claim Interpretation
5.	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

6.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
7.	This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  
Such claim limitation(s) is/are: 
“an instance mutual information loss branch” in Claim 11 and its dependents.
“a feature mutual information loss branch” in Claim 11 and its dependents.
“training module” in Claim 16
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
8.	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


9.	Claim 11 and its dependents and Claim 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

10.	Claim limitations “an instance mutual information loss branch” and “a feature mutual information loss branch” in Independent Claim 11 and its dependents and “training module” in Claim 16 invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The applicant’s specification does not appear to provide sufficient structure for “an instance mutual information loss branch” and “a feature mutual information loss branch” and “training module”. The specification simply recites the components without describing how the structure performs the entire function in the claim language. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

11.	Claims 5 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claims 5 and 16 recite “end-to-end fashion” in reference to the inference neural network and a Gaussian mixture model. This limitation is insufficient because it is unclear how exactly the inference neural network and Gaussian mixture model are arranged in the “end-to-end fashion” and how this relates to the instance and feature autoencoders within the system of Claim 11.  Further, there are no additional support/details describing the structure or what an “end-to-end fashion” consists of, within the applicant’s specification.

12.	Claim 8 and its dependents are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 8 recites the limitation “respective autoencoders”.  There is insufficient antecedent basis for this limitation in the claim. There is no prior mention of autoencoders in Claims 1 and 6 which Claim 8 is dependent upon. Further, it is unclear what “respective autoencoders” refers to within the context of the claim language, therefore, the limitation is insufficient.

Claim Rejections - 35 USC § 101
13.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


14.	Claims 1-18 and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
	Claim 1 recites a method for co-clustering data, comprising: reducing dimensionality for instances and features of an input dataset independently of one another; determining a mutual information loss for the instances and the features independently of one another; cross-correlating the instances and the features, using a processor, based on the mutual information loss, to determine a cross-correlation loss; and determining co-clusters in the input data based on the cross-correlation loss.
	2A Prong 1: The limitation, reducing dimensionality for instances and features of an input dataset independently of one another, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, reducing dimensionality between instances and features independently of one another may be performed manually by a user. Further, the limitation, determining a mutual information loss for the instances and the features independently of one another, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, determining a mutual information loss for instances and features independently of one another may be performed manually by a user making a determination based on mutual information of the instances and features of a dataset. Further, the limitation, cross-correlating the instances and the features, using a processor, based on the mutual information loss, to determine a cross-correlation loss, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “using a processor”, cross-correlating the instances and features based on the mutual information loss to determine a cross-correlation loss may be performed manually by a user cross-correlating the instances and features. Further, the limitation, determining co-clusters in the input data based on the cross-correlation loss, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, determining co-clusters based on the cross-correlation loss may be performed manually by a user determining co-clusters. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
	2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional element – a processor. The processor is recited at a high-level of generality (i.e. as a generic processor able to cross-correlate instances and features) such that it amounts to no more than mere instructions to apply the exception using generic computer components. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they
do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an
abstract idea.
	2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of a processor amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.
	For the reasons above, Claim 1 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 2-10. The additional limitations of the dependent claim are addressed below.

	Claim 2 recites the method of claim 1, further comprising classifying a new instance based on associated new features. At Step 2A Prong 1, Dependent Claim 2 recites mental process “classifying a new instance based on associated new features”, which may be performed manually by a user. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 3 recites the method of claim 1, wherein the instances include documents and the features include words associated with respective documents. Dependent Claim 3 is just another activity specifying that the instances include documents and the features include words associated with the respective documents. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 4 recites the method of claim 1, wherein determining the mutual information loss includes an inference neural network step and a Gaussian mixture model step. Dependent Claim 4 is just another activity specifying the use of an inference neural network and a Gaussian mixture model, such that it amounts to no more than mere instructions to apply the exception using generic machine learning components. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 5 recites the method of claim 4, further comprising an inference neural network and a Gaussian mixture model in an end-to-end fashion. Dependent Claim 5 is just another activity specifying the arrangement of the inference neural network and Gaussian mixture model in an end-to-end fashion. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 6 recites the method of claim 1, wherein determining co-clusters includes optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross-correlation loss term that includes the determined cross-correlation loss. At Step 2A Prong 1, Dependent Claim 6 recites mathematical process “optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross-correlation loss term that includes the determined cross-correlation loss”, which may be performed by mathematical calculation. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

Claim 7 recites the method of claim 6, wherein the objective function is: 
    PNG
    media_image1.png
    31
    142
    media_image1.png
    Greyscale

where J1 is the reconstruction loss term for the instances, J2 is the reconstruction loss term for the features, J3 is the cross-correlation loss term, θr and θc, are dimension reduction parameters for the instances and the features, respectively, and ηr and ηc, are mutual information loss parameters for the instances and the features, respectively. Dependent Claim 7 is just another activity specifying the objective function and according parameters within the defined objective function. Further, at Step 2A Prong 1, determining the objective function may be considered a mathematical process which may be performed by mathematical calculation. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 8 recites the method of claim 6, wherein reducing the dimensionality of the instances and the features comprises applying respective autoencoders to the input data. Dependent Claim 8 is just another activity the application of respective autoencoders to input data, such that it amounts to no more than mere instructions to apply the exception using generic machine learning components. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 9 recites the method of claim 8, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality. At Step 2A Prong 1, Dependent Claim 9 recites mental and mathematical processes “determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality”, which may be performed manually by a user through the use of mathematical calculations. At Step 2A Prong 2 and Step 2B, the additional element “each autoencoder” do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 10 recites the method of claim 1, further comprising performing text classification using the determined co-clusters. At Step 2A Prong 1, Dependent Claim 10 recites mental process “performing text classification using the determined co-clusters”, which may be performed manually by a user classifying text documents using the determined co-clusters. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

	Claim 11 recites a data co-clustering system, comprising:  an instance autoencoder configured to reduce a dimensionality for instances of an input dataset; a feature autoencoder configured to reduce a dimensionality for features of an input dataset; an instance mutual information loss branch configured to determining a mutual information loss for the instances; a feature mutual information loss branch configured to determine a mutual information loss for the features; a processor configured to cross-correlate the instances and the features based on the mutual information loss, to determine a cross-correlation loss and to determine co-clusters in the input data based on the cross-correlation loss.
2A Prong 1: The limitation, reduce a dimensionality for instances of an input dataset, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “instance autoencoder”, reducing dimensionality for instances may be performed manually by a user. The limitation, reduce a dimensionality for features of an input dataset, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “feature autoencoder”, reducing dimensionality for features may be performed manually by a user. Further, the limitation, determining a mutual information loss for the instances, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “instance mutual information loss branch”, determining a mutual information loss for instances may be performed manually by a user making a determination. Further, the limitation, determining a mutual information loss for the features, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “feature mutual information loss branch”, determining a mutual information loss for features may be performed manually by a user making a determination. Further, the limitation, cross-correlate the instances and the features based on the mutual information loss, to determine a cross-correlation loss and to determine co-clusters in the input data based on the cross-correlation loss, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “a processor”, cross-correlating the instances and features based on the mutual information loss to determine a cross-correlation loss and to determine co-clusters based on the cross-correlation loss may be performed manually by a user cross-correlating the instances and features to determine the corresponding co-clusters based on the determined loss. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
	2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements – instance autoencoder and feature autoencoder. The instance autoencoder and feature autoencoders are recited at a high-level of generality (i.e. as a generic autoencoder configured to reduce dimensionality for instances and features respectively) such that it amounts to no more than mere instructions to apply the exception using generic computer components. Further the claim recites additional elements – instance mutual information loss branch and feature mutual information loss branch. The instance mutual information loss branch and feature mutual information loss branch are recited at a high-level of generality (i.e. as generic branches configured to determine mutual information loss for instances and features respectively) such that it amounts to no more than mere instructions to apply the exception using generic computer components. Further, the claim recites the additional element – a processor. The processor is recited at a high-level of generality (i.e. as a generic processor able to cross-correlate instances and features, determine a cross-correlation loss, and determine co-clusters based on the cross-correlation loss) such that it amounts to no more than mere instructions to apply the exception using generic computer components. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they
do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an
abstract idea.
	2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of an instance autoencoder and a feature autoencoder amount to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of an instance mutual information loss branch and a feature mutual information loss branch amount to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of a processor amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.
For the reasons above, Claim 11 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 12-18 and 20. The additional limitations of the dependent claim are addressed below.

Claim 12 recites the system of claim 11, wherein the processor is further configured to classify a new instance based on associated new features. At Step 2A Prong 1, Dependent Claim 12 recites mental process “classify a new instance based on associated new features”, which may be performed manually by a user. Accordingly, at Step 2A Prong 2 and Step 2B, the additional element “processor” does not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 13 recites the system of claim 11, wherein the instances include documents and the features include words associated with respective documents. Dependent Claim 13 is just another activity specifying that the instances include documents and the features include words associated with the respective documents. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 14 recites the system of claim 11, wherein the input dataset comprises a matrix having columns that represent one of the features and the instances and rows that represent the other of the features and the instances. Dependent Claim 14 is just another activity specifying that the input dataset comprises a matrix with columns that represent one of the features and the instances and rows that represent the other features and instances. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 15 recites the system of claim 11, wherein each mutual information loss branch determines a respective mutual information loss using an inference neural network and a Gaussian mixture model. Dependent Claim 15 is just another activity specifying the use of an inference neural network and a Gaussian mixture model, such that it amounts to no more than mere instructions to apply the exception using generic machine learning components. Accordingly, under Step 2A Prong 2 and Step 2B, the additional element “each mutual information loss branch” does not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 16 recites the system of claim 15, further comprising a training module configured to train the inference neural network and a Gaussian mixture model in an end-to-end fashion. Dependent Claim 16 is just another activity specifying the training and arrangement of the inference neural network and Gaussian mixture model in an end-to-end fashion. Accordingly, under Step 2A Prong 2 and Step 2B, the additional element “training module” does not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 17 recites the system of claim 11, wherein the processor is further configured to determine co-clusters using optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross- correlation loss term that includes the determined cross-correlation loss. At Step 2A Prong 1, Dependent Claim 17 recites mathematical process “optimizing an objective function that includes a respective dimension reconstruction loss term for the instances and for the features and a cross-correlation loss term that includes the determined cross-correlation loss”, which may be performed by mathematical calculation. Accordingly, under Step 2A Prong 2 and Step 2B, the additional element “processor” does not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 18 recites the system of Claim 17, wherein the objective function is: 
    PNG
    media_image1.png
    31
    142
    media_image1.png
    Greyscale
where J1 is the reconstruction loss term for the instances, J2 is the reconstruction loss term for the features, J3 is the cross-correlation loss term, θr and θc, are dimension reduction parameters for the instances and the features, respectively, and ηr and ηc, are mutual information loss parameters for the instances and the features, respectively. Dependent Claim 18 is just another activity specifying the objective function and according parameters within the defined objective function. Further, at Step 2A Prong 1, determining the objective function may be considered a mathematical process which may be performed by mathematical calculation. Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 11.

Claim 20 recites the method of claim 17, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality. At Step 2A Prong 1, Dependent Claim 20 recites mental and mathematical processes “determines a dimension reconstruction loss by reducing the dimensionality of data and then restoring the reduced dimensionality data to an original dimensionality”, which may be performed manually by a user through the use of mathematical calculations. At Step 2A Prong 2 and Step 2B, the additional element “each autoencoder” does not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

Claim Rejections - 35 USC § 103
15.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


16.	Claims 1-10 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (hereinafter Zhang) (US PG-PUB 20180165554), in view of Bates et al. (hereinafter Bates) (US PG-PUB 20090300547).
Regarding Claim 1, Zhang teaches a method for co-clustering data, comprising: 
reducing dimensionality (Zhang, Par. [0017], “Clustering acts to effectively reduce the dimensionality of a data set by treating each cluster as a degree of freedom, with a distance from a centroid or other characteristic exemplar of the set. In a non-hybrid system, the distance is a scalar, while in systems that retain some flexibility at the cost of complexity, the distance itself may be a vector.”, thus, dimensionality is reduced by clustering data) for instances and features of an input dataset independently of one another (Zhang, Par. [0020], “Thus, for example, a semantic database may be represented as a set of documents with words or phrases. Words may be ambiguous, such as “apple”, representing a fruit, a computer company, a record company, and a musical artist. In order to effectively use the database, the multiple meanings or contexts need to be resolved. In order to resolve the context, an automated process might be used to exploit available information for separating the meanings, i.e., clustering documents according to their context. This automated process can be difficult as the data set grows, and in some cases the available information is insufficient for accurate automated clustering.”, therefore, instances (documents) and features (words within the documents) are clustered according to context, to reduce dimensionality); 
cross-correlating the instances and the features, using a processor (Zhang, Par. [0116], “Exemplary hardware includes at least one processor coupled to a memory.”, therefore, a processor is used), based on the mutual information loss (See introduction of Bates reference below for mutual information loss), to determine a cross-correlation loss (Zhang, Par. [0049], “In particular, a novel loss function is provided for training autoencoders that are directly coupled with the classification task. A linear classifier is first trained on BoW, then a Bregman Divergence [Banerjee et al. 2004] is derived as the loss function of a subsequent autoencoder. The new loss function gives the autoencoder the information about directions along which the reconstruction should be accurate, and where larger reconstruction errors are tolerated. Informally, this can be considered as a weighting of words based on their correlations with the class label: predictive words should be given large weights in the reconstruction even they are not frequent words, and vice versa.”, thus, instances and features are cross correlated according to a particular class label (or a derived loss can be applied to unlabeled data) and a Breman Divergence is derived as the loss function).; and 

Zhang does not explicitly disclose determining a mutual information loss for the instances and the features independently of one another;
However, Bates teaches determining a mutual information loss for the instances and the features independently of one another (Bates, Par. [0145], “Where {circumflex over (X)} and and the two random variables induced by coclustering. Information theory can be used to give a theoretical formulation to the coclustering problem: the optimal co-clustering is one that minimizes the loss in mutual information between the original random variables and the mutual information between the clustered random variables. Given any coclustering, we define a joint distribution”, therefore, loss in mutual information between features and instances is determined);

Zhang does not explicitly disclose determining co-clusters in the input data based on the cross-correlation loss
However, Bates teaches determining co-clusters in the input data based on the cross-correlation loss (Bates, Par. [0144], “More sophisticated algorithms can be used to carry out co-clustering. In co-clustering, articles and users are clustered to create article-user clustered niches. In a preferred embodiment, the algorithm used is a co-clustering algorithm, licensed by the National Research Council of Canada. The steps of the algorithm are as follows: [0145] (a) Given the number of row clusters k and number of column clusters l, and the initial rating matrix, we normalized the rating matrix, the resulting matrix is actually a non-negative contingency table, which can be regarded as a joint distribution p(x, y), where X and Y and two discrete random variables that take values over the rows and columns.”, thus, co-clusters in the data are determined based on cross-correlation loss between features and instances, which is in turn based on mutual information loss as described in Par. [0145]).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method for reducing dimensionality for instances and features and cross-correlating the instances and features to determine a cross-correlation loss, as disclosed by Zhang to include determining a mutual information loss for instances and features and determining co-clusters based on the cross-correlation loss, as disclosed by Bates. One of ordinary skill in the art would have been motivated to make this modification to enable creating co-clusters that minimize the loss in mutual information between features and instances to exploit similarities, hence improving performance and reducing noise (Bates, Par. [0141-0143], “There are numerous benefits to clustering including: [0142] (a) it improves performance, since the resultant "clustered" matrix has smaller dimensions; and, [0143] (b) it groups similar users or articles together which reduces noise--and finding similarity of some kind is the key concept behind the recommender system.”).

Regarding Claim 2, Zhang in view of Bates teaches the method of claim 1, further comprising classifying a new instance based on associated new features (Zhang, Par. [0052], “training an objective function of a linear classifier, based on a set of labeled data, to derive a set of classifier weights; defining a posterior probability distribution on the set of classifier weights of the linear classifier; approximating a marginalized loss function for an autoencoder as a Bregman divergence, based on the posterior probability distribution on the set of classifier weights learned from the linear classifier; and classifying unlabeled data using a compact classifier according to the marginalized loss function. The data may comprise semantic data, textual data, and may consist essentially of text documents.”, thus, a new instance would be classified by the compact classifier which is based on the loss function considering previous instances and features).

Regarding Claim 3, Zhang in view of Bates teaches the method of claim 1, wherein the instances include documents and the features include words associated with respective documents (Zhang, Par. [0003], “In machine learning, documents are usually represented as Bag of Words (BoW), which nicely reduces a piece of text with arbitrary length to a fixed length vector. Despite its simplicity, BoW remains the dominant representation in many applications including text classification. There has also been a large body of work dedicated to learning useful representations for textual data (Turney and Pantel 2010; Blei, Ng, and Jordan 2003; Deerwester et al. 1990; Mikolov et al. 2013; Glorot, Bordes, and Bengio 2011). By exploiting the co-occurrence pattern of words, one can learn a low dimensional vector that forms a compact and meaningful representation for a document.”, thus, instances include documents and associated features include words associated with respective documents. Further detailed in Par. [0007]).

Regarding Claim 4, Zhang in view of Bates teaches the method of claim 1, wherein determining the mutual information loss includes an inference neural network step (Zhang, Par. [0098], “FIG. 2 shows a preferred embodiment of the method. The labelled set of data is received 201, and used to train a classifier, which in this case is an artificial neural network trained on a “bag of words” representation of the labeled data, using SVM2 with squared hinge loss and l.sub.2 regularization as the linear classifier 202. The trained set of weights is then exported in a learning transfer process, to a denoising autoencoder 203.”, thus, a neural network is trained to determine loss and reduce dimensionality between features and instances (described further in preceding Par. [0090-0097]) and a Gaussian mixture model step (Zhang, Par. [0046], “Mixtures of Gaussians and locally weighted regression are two statistical models that offer elegant representations and efficient learning algorithms.”, thus, gaussian mixture models are used to determine loss as well, further information on the gaussian distribution used can be found in Par. [0084-0087]).

Regarding Claim 5, Zhang in view of Bates teaches the method of claim 4, further comprising an inference neural network (See Claim 4 above for teaching of inference neural network) and a Gaussian mixture model (See Claim 4 above for teaching of Gaussian mixture model) in an end-to-end fashion (Zhang, Figure 2, depicts the training of a neural network (label 202), determining the loss which may involve GMM (label 205), training the classifier (label 206), and classifying data (label 208) such that the process flow and corresponding architecture occur in an end-to-end fashion).

Regarding Claim 6, Zhang in view of Bates teaches the method of claim 1, wherein determining co-clusters includes optimizing (Zhang, Par. [0015], “After the cost or distance function is defined and formulated as clustering criteria, the clustering process becomes one of optimization according to an optimization process, which itself may be imperfect or provide different optimized results in dependence on the particular optimization employed. For large data sets, a complete evaluation of a single optimum state may be infeasible, and therefore the optimization process subject to error, bias, ambiguity, or other known artifacts.”, thus, co-clusters are determined by an optimization process) an objective function (Zhang, Par. [0061], “Autoencoders learn functions that can reconstruct the inputs. They are typically implemented as a neural network with one hidden layer, and one can extract the activation of the hidden layer as the new representation. Mathematically, a collection of data points X={xi}, xi∈Rd, i∈[1, m] is provided, and the objective function of an autoencoder is thus:”, therefore, an objective function (Equation 1 of the Zhang reference, also shown in the rejection of Claim 7 below) is disclosed and trained/optimized by the autoencoder/neural network) that includes a respective dimension reconstruction loss term for the instances and for the features (Zhang, Par. [0062], “x̃i is the reconstruction”, thus, reconstruction loss of instances and features are captured by x̃i)and a cross-correlation loss term that includes the determined cross-correlation loss (Zhang, Par. [0062], “D is a loss function, such as the squared Euclidean Distance”, thus, cross-correlation loss is also disclosed. Further details on the loss function D can be found in subsequent Par. [0063-0070] where the loss function may also be a Bregman Divergence wherein the linear classifier is trained on a bag of words, as mentioned in the rejection of Claim 1 above).

Regarding Claim 7, Zhang in view of Bates teaches the method of claim 6, wherein the objective function is: 

    PNG
    media_image1.png
    31
    142
    media_image1.png
    Greyscale

where J1 is the reconstruction loss term for the instances, J2 is the reconstruction loss term for the features (Zhang, Par. [0062], “x̃i is the reconstruction”, thus, reconstruction loss of instances and features are captured by x̃i), J3 is the cross-correlation loss term (Zhang, Par. [0062], “D is a loss function”, thus, cross-correlation loss is also disclosed. Further details on the loss function D can be found in subsequent Par. [0063-0070] where the loss function may also be a Bregman Divergence wherein the linear classifier is trained on a bag of words, as mentioned in the rejection of Claim 1 above), θr and θc are dimension reduction parameters for the instances and the features (Zhang, Par. [0062], “where W∈Rkxd, b∈Rk, W′∈Rdxk, bl∈Rd are the parameters to be learned”, thus, these parameters to be learned in order to facilitate dimension reduction for the instances and features), respectively, and ηr and ηc, are mutual information loss parameters for the instances and the features, respectively (Zhang, Par. [0068], “Two of the most commonly used loss functions for autoencoders are the squared Euclidean distance and elementwise KL divergence. It is not difficult to verify that they both fall into this family by choosing ƒ as the squared l.sub.2 norm and the sum of element-wise entropy respectively. What the two loss functions have in common is that they make no distinction among dimensions of the input. In other words, each dimension of the input is pushed to be reconstructed equally well. While autoencoders trained in this way have been shown to work very well on image data, learning much more interesting and useful features than the original pixel intensity features, they are less appropriate for modeling textual data. The reason is two folds. First, textual data are extremely sparse and high dimensional, where the dimensionality is equal to the vocabulary size.”, thus, a squared Euclidean distance or KL divergence may represent the mutual information loss, as further discussed in Par. [0068], word occurrences/frequent words (features) are favored to be reconstructed accurately to better be able to classify documents (instances). Moreover, Par. [0026-0027] discuss overlapping and non-overlapping clustering according to features and instances).

    PNG
    media_image2.png
    110
    282
    media_image2.png
    Greyscale

Equation 1: Zhang reference

    PNG
    media_image3.png
    219
    525
    media_image3.png
    Greyscale

Par. [0062] of Zhang reference

Regarding Claim 8, Zhang in view of Bates teaches the method of claim 6, wherein reducing the dimensionality of the instances and the features comprises applying respective autoencoders to the input data (Zhang, Par. [0049], “According to the present technology, the semisupervised approach is adopted, where label information is introduced to guide the feature learning procedure. In particular, a novel loss function is provided for training autoencoders that are directly coupled with the classification task. A linear classifier is first trained on BoW, then a Bregman Divergence [Banerjee et al. 2004] is derived as the loss function of a subsequent autoencoder. The new loss function gives the autoencoder the information about directions along which the reconstruction should be accurate, and where larger reconstruction errors are tolerated. Informally, this can be considered as a weighting of words based on their correlations with the class label:”, thus, autoencoders are applied to the input data, in order to determine correlations and learn discriminative features, to reduce dimensionality of the data).

Regarding Claim 9, Zhang in view of Bates teaches the method of claim 8, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data (Zhang, Par. [0049], “The new loss function gives the autoencoder the information about directions along which the reconstruction should be accurate, and where larger reconstruction errors are tolerated. Informally, this can be considered as a weighting of words based on their correlations with the class label: predictive words should be given large weights in the reconstruction even they are not frequent words, and vice versa.”, thus, as mentioned in Par. [0048] “traditional autoencoders suffer from at least two aspects: scalability with the high dimensionality of vocabulary size and dealing with task-reelevant words. This problem is addressed by introducing supervision via the loss function of autoencoders”, hence the loss function is able to reduce the dimensionality of the data) and then restoring the reduced dimensionality data to an original dimensionality (Zhang, Par. [0011], “In some cases, the user labels or characteristics are known in advance, and the labelled data classified according to the characteristics of the source. In this example, the classifications are predetermined, and the data may be segregated or labelled with the classification, and thereafter the data selective used based on its original classification or classification characteristics.”, thus, the original dimensionality of the data may be retained as well for classification).

Regarding Claim 10, Zhang in view of Bates teaches the method of claim 1, further comprising performing text classification (Zhang, Par. [0005], “A specific class of task in text mining is addressed as an example of an application of the technology: Sentiment Analysis (SA). A special case of SA is addressed as a binary classification problem, where a given piece of text is either of positive or negative attitude. This problem is interesting largely due to the emergence of online social networks, where people consistently express their opinions about certain subjects. Also, it is easy to obtain a large amount of clean labeled data for SA by crawling reviews from websites such as IMDB or Amazon. Thus, SA is an ideal benchmark for evaluating text classification models (and features). However, the technology itself is not limited to this example.”, thus, text classification is performed) using the determined co-clusters (Zhang, Par. [0111], “Rather than implementing an autoencoder that makes a binary determination along an orthogonal axis, the technology may also be used to classify data as belonging to different clusters. See, en.wikipedia.org/wiki/Cluster_analysis. That is, a decision may be made whether a document should be classified within either of two clusters within a data space. The technology may also be extended to higher dimensions, and therefore is not limited to a simple binary determination.”, therefore, co-clustering is used in performing text classification).

17.	Claims 11-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (hereinafter Zhang) (US PG-PUB 20180165554), in view of Bates et al. (hereinafter Bates) (US PG-PUB 20090300547), further in view of Andoni et al. (hereinafter Andoni) (US PG-PUB 20190228312).
Regarding Claim 11, Zhang teaches a data co-clustering system, comprising: 
an instance autoencoder (Zhang, Par. [0049], “According to the present technology, the semisupervised approach is adopted, where label information is introduced to guide the feature learning procedure. In particular, a novel loss function is provided for training autoencoders that are directly coupled with the classification task.”, thus, an autoencoder is trained to reduce dimensionality according to a loss function) configured to reduce a dimensionality (Zhang, Par. [0017], “Clustering acts to effectively reduce the dimensionality of a data set by treating each cluster as a degree of freedom, with a distance from a centroid or other characteristic exemplar of the set. In a non-hybrid system, the distance is a scalar, while in systems that retain some flexibility at the cost of complexity, the distance itself may be a vector.”, thus, dimensionality is reduced by clustering data) for instances of an input dataset (Zhang, Par. [0020], “Thus, for example, a semantic database may be represented as a set of documents with words or phrases. Words may be ambiguous, such as “apple”, representing a fruit, a computer company, a record company, and a musical artist. In order to effectively use the database, the multiple meanings or contexts need to be resolved. In order to resolve the context, an automated process might be used to exploit available information for separating the meanings, i.e., clustering documents according to their context. This automated process can be difficult as the data set grows, and in some cases the available information is insufficient for accurate automated clustering.”, therefore, instances (documents) and features (words within the documents) are clustered according to context, to reduce dimensionality); 
a feature autoencoder (See introduction of Andoni reference below for teaching of feature autoencoder) configured to reduce a dimensionality (Zhang, Par. [0017], “Clustering acts to effectively reduce the dimensionality of a data set by treating each cluster as a degree of freedom, with a distance from a centroid or other characteristic exemplar of the set. In a non-hybrid system, the distance is a scalar, while in systems that retain some flexibility at the cost of complexity, the distance itself may be a vector.”, thus, dimensionality is reduced by clustering data) for features of an input dataset (Zhang, Par. [0020], “Thus, for example, a semantic database may be represented as a set of documents with words or phrases. Words may be ambiguous, such as “apple”, representing a fruit, a computer company, a record company, and a musical artist. In order to effectively use the database, the multiple meanings or contexts need to be resolved. In order to resolve the context, an automated process might be used to exploit available information for separating the meanings, i.e., clustering documents according to their context. This automated process can be difficult as the data set grows, and in some cases the available information is insufficient for accurate automated clustering.”, therefore, instances (documents) and features (words within the documents) are clustered according to context, to reduce dimensionality); 
a processor (Zhang, Par. [0116], “Exemplary hardware includes at least one processor coupled to a memory.”, therefore, a processor is used) configured to cross-correlate the instances and the features based on the mutual information loss (See introduction of Bates reference below for mutual information loss), to determine a cross-correlation loss (Zhang, Par. [0049], “In particular, a novel loss function is provided for training autoencoders that are directly coupled with the classification task. A linear classifier is first trained on BoW, then a Bregman Divergence [Banerjee et al. 2004] is derived as the loss function of a subsequent autoencoder. The new loss function gives the autoencoder the information about directions along which the reconstruction should be accurate, and where larger reconstruction errors are tolerated. Informally, this can be considered as a weighting of words based on their correlations with the class label: predictive words should be given large weights in the reconstruction even they are not frequent words, and vice versa.”, thus, instances and features are cross correlated according to a particular class label (or a derived loss can be applied to unlabeled data) and a Breman Divergence is derived as the loss function) and 

Zhang does not explicitly disclose an instance mutual information loss branch configured to determining a mutual information loss for the instances
However, Bates teaches an instance mutual information loss branch (Bates, Par. [0102], “Computer system 300 further comprises a User Data Preprocess Module (UDPM) 40. THE UDPM 40 operates on user data to generate implicit rating data based on a set of conversion rules. In a preferred embodiment, the UDPM 40 may be part of the Recommender Module 300. In an embodiment of the present invention, user data is stored in database 360, in association with a unique identifier of the user.”, thus, the user data preprocess module is able to generate an implicit rating based on mutual information between instances) configured to determining a mutual information loss for the instances (Bates, Par. [0145], “Where {circumflex over (X)} and and the two random variables induced by coclustering. Information theory can be used to give a theoretical formulation to the coclustering problem: the optimal co-clustering is one that minimizes the loss in mutual information between the original random variables and the mutual information between the clustered random variables. Given any coclustering, we define a joint distribution”, therefore, loss in mutual information between features and instances is determined); 

Zhang does not explicitly disclose a feature mutual information loss branch configured to determine a mutual information loss for the features
However, Bates teaches a feature mutual information loss branch (Bates, Par. [0100], “Computer system 300 further comprises a User Data Analysis Module (UDAM) 30. The UDAM 30 performs pattern analysis based on user data. For example, it can perform probability analysis to guess the user's gender, age, and profession. For example, a user who has looked at more than two football articles could be predicted to be male, according to a rule-based algorithm. UDAM 30 may also perform user clustering, article clustering, or user-article co-clustering, as is described in more detail below.”, therefore, the user data analysis module is able to process features based on mutual information)configured to determine a mutual information loss for the features (Bates, Par. [0145], “Where {circumflex over (X)} and and the two random variables induced by coclustering. Information theory can be used to give a theoretical formulation to the coclustering problem: the optimal co-clustering is one that minimizes the loss in mutual information between the original random variables and the mutual information between the clustered random variables. Given any coclustering, we define a joint distribution”, therefore, loss in mutual information between features and instances is determined); 

Zhang does not explicitly disclose to determine co-clusters in the input data based on the cross-correlation loss
However, Bates teaches to determine co-clusters in the input data based on the cross-correlation loss (Bates, Par. [0144], “More sophisticated algorithms can be used to carry out co-clustering. In co-clustering, articles and users are clustered to create article-user clustered niches. In a preferred embodiment, the algorithm used is a co-clustering algorithm, licensed by the National Research Council of Canada. The steps of the algorithm are as follows: [0145] (a) Given the number of row clusters k and number of column clusters l, and the initial rating matrix, we normalized the rating matrix, the resulting matrix is actually a non-negative contingency table, which can be regarded as a joint distribution p(x, y), where X and Y and two discrete random variables that take values over the rows and columns.”, thus, co-clusters in the data are determined based on cross-correlation loss between features and instances, which is in turn based on mutual information loss as described in Par. [0145]).

Zhang in view of Bates does not explicitly disclose a feature autoencoder.
However, Andoni teaches a feature autoencoder (Andoni, Par. [0002], “First, the autoencoder learns how to encode an input image to a specified number of features, such as Q features, where Q is typically less than P. Second, the autoencoder learns how to decode a feature vector of Q features to generate a “reconstructed” image having P pixels. In a perfect reconstruction, decoding the Q features generated by encoding an input image results in a reconstructed image that is identical to that input image. Once training is completed, the autoencoder can encode an input image to generate a compressed representation and then decode the compressed representation to get back the original input image with hopefully minimal error rate.”, thus a feature autoencoder is disclosed and aims at reducing dimensionality of input data)

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the data co-clustering system for reducing dimensionality for instances and features and cross-correlating the instances and features to determine a cross-correlation loss, as disclosed by Zhang to include determining a mutual information loss for instances and features and determining co-clusters based on the cross-correlation loss, as disclosed by Bates. One of ordinary skill in the art would have been motivated to make this modification to enable creating co-clusters that minimize the loss in mutual information between features and instances to exploit similarities, hence improving performance and reducing noise (Bates, Par. [0141-0143], “There are numerous benefits to clustering including: [0142] (a) it improves performance, since the resultant "clustered" matrix has smaller dimensions; and, [0143] (b) it groups similar users or articles together which reduces noise--and finding similarity of some kind is the key concept behind the recommender system.”).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the data co-clustering system for reducing dimensionality for instances and features, cross-correlating the instances and features to determine a cross-correlation loss, determining a mutual information loss for instances and features, and determining co-clusters based on the cross-correlation loss, as disclosed by Zhang in view of Bates to include the use of a feature autoencoder, as disclosed by Andoni. One of ordinary skill in the art would have been motivated to make this modification to allow for unsupervised learning, thereby enabling better feature detection and selection even when labels are not provided and thus, increasing efficiency and improving feature dimensionality reduction (Andoni, Par. [0003], “A variational autoencoder (VAE) is one way of solving such generative problems. In a VAE, randomness is introduced during training. The encoder of the VAE produces a mean and a variance (deterministically), which provides a probability distribution in a latent space. During training, that mean and variance is used to randomly sample from a Gaussian distribution to get an encoded vector, which is then (deterministically) decoded. During evaluation (i.e., after training is completed), the VAE is used to either encode data (in which case only mean produced by the encoder is used) or to decode a given vector. Thus, returning to the video game trees example described above, slightly different trees may be generated by randomly sampling different vectors (in the latent space), and providing those vectors to the decoder to decode those vectors into trees. Since the input to the decoder will be slightly different, the output will be slightly different as well.”)

Regarding Claim 12, Zhang in view of Bates further in view of Andoni teaches the system of claim 11, wherein the processor is further configured to classify a new instance based on associated new features (Zhang, Par. [0052], “training an objective function of a linear classifier, based on a set of labeled data, to derive a set of classifier weights; defining a posterior probability distribution on the set of classifier weights of the linear classifier; approximating a marginalized loss function for an autoencoder as a Bregman divergence, based on the posterior probability distribution on the set of classifier weights learned from the linear classifier; and classifying unlabeled data using a compact classifier according to the marginalized loss function. The data may comprise semantic data, textual data, and may consist essentially of text documents.”, thus, a new instance would be classified by the compact classifier which is based on the loss function considering previous instances and features).

Regarding Claim 13, Zhang in view of Bates further in view of Andoni teaches the system of claim 11, wherein the instances include documents and the features include words associated with respective documents (Zhang, Par. [0003], “In machine learning, documents are usually represented as Bag of Words (BoW), which nicely reduces a piece of text with arbitrary length to a fixed length vector. Despite its simplicity, BoW remains the dominant representation in many applications including text classification. There has also been a large body of work dedicated to learning useful representations for textual data (Turney and Pantel 2010; Blei, Ng, and Jordan 2003; Deerwester et al. 1990; Mikolov et al. 2013; Glorot, Bordes, and Bengio 2011). By exploiting the co-occurrence pattern of words, one can learn a low dimensional vector that forms a compact and meaningful representation for a document.”, thus, instances include documents and associated features include words associated with respective documents. Further detailed in Par. [0007]).

Regarding Claim 14, Zhang in view of Bates further in view of Andoni teaches the system of claim 11, wherein the input dataset comprises a matrix having columns that represent one of the features and the instances and rows that represent the other of the features and the instances (Bates, Par. [0188], “In an online recommender system, the data given are a matrix of user-article ratings.”, thus, as also described in Par. [0113] and subsequent chart, the instances and features (users and articles) input are provided as a matrix, with columns and rows that represent features and instances).
The reasons of obviousness have been noted in the rejection of Claim 11 above and applicable herein.

Regarding Claim 15, Zhang in view of Bates further in view of Andoni teaches the system of claim 11, wherein each mutual information loss branch determines a respective mutual information loss using an inference neural network (Zhang, Par. [0098], “FIG. 2 shows a preferred embodiment of the method. The labelled set of data is received 201, and used to train a classifier, which in this case is an artificial neural network trained on a “bag of words” representation of the labeled data, using SVM2 with squared hinge loss and l.sub.2 regularization as the linear classifier 202. The trained set of weights is then exported in a learning transfer process, to a denoising autoencoder 203.”, thus, a neural network is trained to determine loss and reduce dimensionality between features and instances (described further in preceding Par. [0090-0097]) and a Gaussian mixture model (Zhang, Par. [0046], “Mixtures of Gaussians and locally weighted regression are two statistical models that offer elegant representations and efficient learning algorithms.”, thus, gaussian mixture models are used to determine loss as well, further information on the gaussian distribution used can be found in Par. [0084-0087]).

Regarding Claim 16, Zhang in view of Bates further in view of Andoni teaches the system of claim 15, further comprising a training module (Zhang, Par. [0119], “The hardware operates under the control of an operating system, and executes various computer software applications, components, programs, objects, modules, etc. to implement the techniques described above.”, thus, a module may be used to facilitate the training) configured to train the inference neural network (See Claim 15 above for teaching of inference neural network) and a Gaussian mixture model (See Claim 15 above for teaching of inference neural network) in an end-to-end fashion (Zhang, Figure 2, depicts the training of a neural network (label 202), determining the loss which may involve GMM (label 205), training the classifier (label 206), and classifying data (label 208) such that the process flow and corresponding architecture occur in an end-to-end fashion)..

Regarding Claim 17, Zhang in view of Bates further in view of Andoni teaches the system of claim 11, wherein the processor is further configured to determine co-clusters using optimizing (Zhang, Par. [0015], “After the cost or distance function is defined and formulated as clustering criteria, the clustering process becomes one of optimization according to an optimization process, which itself may be imperfect or provide different optimized results in dependence on the particular optimization employed. For large data sets, a complete evaluation of a single optimum state may be infeasible, and therefore the optimization process subject to error, bias, ambiguity, or other known artifacts.”, thus, co-clusters are determined by an optimization process) an objective function (Zhang, Par. [0061], “Autoencoders learn functions that can reconstruct the inputs. They are typically implemented as a neural network with one hidden layer, and one can extract the activation of the hidden layer as the new representation. Mathematically, a collection of data points X={xi}, xi∈Rd, i∈[1, m] is provided, and the objective function of an autoencoder is thus:”, therefore, an objective function (Equation 1 of the Zhang reference, also shown in the rejection of Claim 7 below) is disclosed and trained/optimized by the autoencoder/neural network)  that includes a respective dimension reconstruction loss term for the instances and for the features (Zhang, Par. [0062], “x̃i is the reconstruction”, thus, reconstruction loss of instances and features are captured by x̃i) and a cross- correlation loss term that includes the determined cross-correlation loss(Zhang, Par. [0062], “D is a loss function, such as the squared Euclidean Distance”, thus, cross-correlation loss is also disclosed. Further details on the loss function D can be found in subsequent Par. [0063-0070] where the loss function may also be a Bregman Divergence wherein the linear classifier is trained on a bag of words, as mentioned in the rejection of Claim 1 above).

Claim 18 recites substantially the same limitations as Claim 7 in the form of a system, therefore it is rejected under the same rationale. 

Regarding Claim 20, Zhang in view of Bates further in view of Andoni teaches the method of claim 17, wherein each autoencoder determines a dimension reconstruction loss by reducing the dimensionality of data (Zhang, Par. [0049], “The new loss function gives the autoencoder the information about directions along which the reconstruction should be accurate, and where larger reconstruction errors are tolerated. Informally, this can be considered as a weighting of words based on their correlations with the class label: predictive words should be given large weights in the reconstruction even they are not frequent words, and vice versa.”, thus, as mentioned in Par. [0048] “traditional autoencoders suffer from at least two aspects: scalability with the high dimensionality of vocabulary size and dealing with task-reelevant words. This problem is addressed by introducing supervision via the loss function of autoencoders”, hence the loss function is able to reduce the dimensionality of the data) and then restoring the reduced dimensionality data to an original dimensionality (Zhang, Par. [0011], “In some cases, the user labels or characteristics are known in advance, and the labelled data classified according to the characteristics of the source. In this example, the classifications are predetermined, and the data may be segregated or labelled with the classification, and thereafter the data selective used based on its original classification or classification characteristics.”, thus, the original dimensionality of the data may be retained as well for classification).

Conclusion
18.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure: 
Clinchant et al. (US PG-PUB 20180024968) disclosed methods which combine denoising autoencoders with domain prediction regularization for domain adaptation tasks.
Song et al. (US PG-PUB 20200065656) disclosed methods, systems, and apparatus for training a neural network using a clustering loss.
Miotto et al. (US PG-PUB 20200327404) disclosed a computing system that obtains sparse vectors, including entity features, which are applied to a plurality of denoising encoders.
 Habibian et al. (US PG-PUB 20180129906) disclosed an artificial neural network for tracking a target across a sequence of frames, including cross-correlation and loss layers.
Usama et al. (“Unsupervised Machine Learning for Networking: Techniques, Applications and Research Challenges”) disclosed an overview of various unsupervised machine learning techniques, including autoencoders, inference neural networks, and gaussian mixture models.
Peng et al. (“Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy”) disclosed feature selection based on correlation/mutual information.
Badino et al. (“An Auto-encoder based Approach to Unsupervised Learning of Subword Units”) disclosed an autoencoder-based method for unsupervised identification of subword units.
Qiu (“Image and Feature Co-Clustering”) disclosed methods for simultaneously modeling and clustering large sets of images and their visual features.
Vincent et al. (“Extracting and Composing Robust Features with Denoising Autoencoders”) disclosed training denoising autoencoders to extract robust features.

19.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Devika S Maharaj whose telephone number is 571-272-0829. The examiner can normally be reached Monday - Thursday 7:30am - 4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/D.S.M./Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123