Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is the initial office action that has been issued in response to patent application 16/775,635 filed on 01/29/2020. Claims 1-20, as originally filed, are currently pending and have been considered below. Claim 1, 11 and 17 are independent claims.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1, 5-9, 11-15, and 17-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kaufhold et al. (US9990687B1)
Regarding Claim 1,
Kaufhold et al. teaches a method performed on a computing device, the method comprising (Kaufhold et al., FIG. 9 and Col. 27 Lines 44-45, “FIG. 9 is a block diagram of an exemplary embodiment of a Computing Device 900” teaches a method performed on a computing device). 
providing a machine learning model having one or more layers and associated parameters (Kaufhold et al., Col. 12-13 Lines 66-67 and Lines 1-2, “designing a deep embedding architecture, which includes … types and number of layers” teaches a deep embedding architecture (corresponds to a machine learning model) with a plurality of layers. Col 19 Lines 30-40, “optimization parameters are systematically perturbed to search for an improved combination of deep embedding architecture parameters … the settings of one or more free parameters in the deep embedding architecture (i.e., the parameterization of one or more weights and biases and other parameters in the deep embedding architecture 630)” teaches weights, biases and other parameters in the deep embedding architecture).
performing a pretraining stage on the parameters of the machine learning model to obtain pretrained parameters (Kaufhold et al., Col. 12 Lines 1-2, “The parametric t-SNE approach (Van der Maaten, 2009) is separated into three distinct stages, proceeding through (1) pretraining… a sampling-based pretraining step, in general, teaches away from improved optimization properties” teaches performing a pretraining stage to obtain pretrained parameters).
performing a tuning stage on the machine learning model by using labeled training samples to tune the pretrained parameters, the tuning stage including (Kaufhold et al., Col. 13 Lines 7-10, “tuning the parameters of a deep embedding architecture to reproduce, as reliably as possible, the generated embedding for each training sample (i.e., training the deep embedding architecture)” teaches performing a tuning step on the parameters of a deep embedding architecture. Col. 14 Lines 9-12, “in supervised machine learning problems, more labeled training examples typically coincide with better performance… the collection of labeled training data has been manual” teaches utilizing labeled training data for better performance).
performing noise adjustment of the labeled training samples to obtain noise-adjusted training samples (Kaufhold et al., Col. 20-21 Lines 58-67 and Line 1, “optimization process 620 parameters include… parameters governing data augmentation (such as adding noise or deforming, translating and/or rotating high dimensional objects during training)” teaches data augmentation  with noise (corresponds to noise adjustment) on the hyper-parameters. Col. 16 Lines 56-60, “produce an intermediate byproduct embedding 445 of high dimensional input objects 410, and those byproduct embeddings 445 are reused as input 410 (and optionally its labels 420) to a train 430 a subsequent preprocessing step's deep analyzer 440” teaches the training sample being labeled). 
adjusting the pretrained parameters based at least on the labeled training samples and the noise-adjusted training samples to obtain adapted parameters (Kaufhold et al., Col. 20-21 Lines 58-67 and Line 1, “optimization process 620 parameters include… parameters governing data augmentation (such as adding noise or deforming, translating and/or rotating high dimensional objects during training)” teaches an optimization process to obtain adapted parameters that includes the labeled training samples and the noise-adjusted training samples). 
outputting a tuned machine learning model having the adapted parameters (Kaufhold et al., Col. 13 Lines 10-11, “deploying the trained deep embedding architecture” teaches deploying the trained deep embedding architecture (corresponds to a tuned machine learning model) with the adapted parameters).
Regarding Claim 5,
Kaufhold et al. teaches the method of claim 1, further comprising:
Kaufhold et al. further teaches after the tuning stage, performing a particular task on input data using the tuned machine learning model (Kaufhold et al., Col. 13 Lines 10-13, “deploying the trained deep embedding architecture to convert new high dimensional data objects into approximately the same embedded space as found in step (1)” teaches deploying the trained deep embedding architecture (corresponds to after the tuning stage, the tuned machine learning model) to convert new high dimensional data objects (corresponds to performing a particular task on input data)).
Regarding Claim 6,
Kaufhold et al. teaches the method of claim 1,
Kaufhold et al. further teaches wherein the machine learning model comprises one or more embedding layers and at least one task-specific layer (Kaufhold et al., Col. 16 Lines 30-33, “a deep architecture having many hidden layers (called a deep analyzer 317/440… envisioned for learning a formal embedding 325)” teaches activation layers learns a formal embedding (corresponds to one or more embedding layer)). Col. 16 Lines 30-35, “a deep architecture having many hidden layers… is trained for a particular task (object recognition, generative image modeling, or image translation, for instance)” teaches hidden layers trained for a particular task (corresponds to at least one task-specific layer)).
Regarding Claim 7,
Kaufhold et al. teaches the method of claim 6,
Kaufhold et al. further teaches wherein the one or more embedding layers comprise a lexicon encoder or a transformer encoder (Kaufhold et al., Col. 17 Lines 16-19, “the byproduct embedding 445 may be taken as a hidden layer corresponding to a lower dimensional representation (i.e., the bottleneck of an autoencoder” teaches the hidden layer (corresponds to one or more embedding layers) comprising of an autoencoder (corresponds to a lexicon encoder or a transformer encoder)).
Regarding Claim 8,
Kaufhold et al. teaches the method of claim 6,
Kaufhold et al. further teaches wherein the pretraining stage comprises unsupervised learning of the parameters of the one or more embedding layers (Kaufhold et al., Col. 12 Lines 1-2, “The parametric t-SNE approach (Van der Maaten, 2009) is separated into three distinct stages, proceeding through (1) pretraining… a sampling-based pretraining step, in general, teaches away from improved optimization properties” teaches performing a pretraining stage. Col. 17 Lines 13-16, “training a deep analyzer 430 may comprise unsupervised learning of a reconstruction function, such as a deep convolutional autoencoder” teaches unsupervised learning for training a deep analyzer (corresponds to the parameters of the one or more embedding layers)).
Regarding Claim 9,
Kaufhold et al. teaches the method of claim 8,
Kaufhold et al. further teaches wherein the tuning stage adjusts the parameters of the one or more embedding layers and the parameters of the task-specific layer (Kaufhold et al., Col. 13 Lines 4-10, “(2b) designing a training strategy (i.e., tuning optimization algorithm hyper-parameters, including learning rate, momentum, dropout rate by layer, etc.), (2c) tuning the parameters of a deep embedding architecture to reproduce, as reliably as possible, the generated embedding for each training sample (i.e., training the deep embedding architecture)” teaches a tuning stage for tuning (adjusting) the parameters of a deep embedding architecture (consist of the parameters of the one or more embedding layers and the parameters of the task-specific layer).
Regarding Claim 11,
Kaufhold et al. teaches a system comprising (Kaufhold et al., FIG. 3 and Col. 15 Lines 65-67, “FIG. 3 illustrates an exemplary embodiment of a method and system for deep embedding and its deployment, in accordance with the present invention” teaches a system). 
a hardware processing unit (Kaufhold et al., Col. 28 Lines 56-58, “Processor 930 may comprise a device and/or set of machine-readable instructions for performing one or more predetermined tasks” teaches a hardware processing unit). 
a storage resource storing computer-readable instructions which, when executed by the hardware processing unit, cause the hardware processing unit to (Kaufhold et al., Col. 30 Lines 25-32, “a computer-readable storage medium or device which when loaded into a computer system is able to carry out the different methods described herein. “Computer program” in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or indirectly” teaches a computer-readable storage medium (corresponds to a storage resources) storing a set of instructions that are executed by the processor (corresponds to hardware processing unit))
receive input data (Kaufhold et al., Col. 23 Lines 56-57, “each input high dimensional object, xs 354, received from the upstream application 353/810” teaches ).
process the input data using a machine learning model having a first layer and a second layer to obtain a result, the first layer having been pretrained in a pretraining stage, the first layer and the second layer having been tuned together using virtual adversarial regularization (Kaufhold et al., Col. 23 Lines 54-60, “for each input high dimensional object, xs 354, received from the upstream application 353/810 the deployed SFF deep embedding device 355/820 produces exactly one low dimensional embedding, as 356, of that high dimensional object” teaches the deep embedding device (corresponds to machine learning model) that processes high dimensional object (corresponds to input data) to produce a low dimensional embedding (corresponds to a result). Col. 16 Lines 30-33, “a deep architecture having many hidden layers (called a deep analyzer 317/440 in contradistinction to the deep embedding architecture 337/620 envisioned for learning a formal embedding 325)” teaches the deep architecture (corresponds to machine learning model) having hidden layers (corresponds to a first layer and a second layer). Col. 16 Lines 9-13, “pre-processing module as first illustrated in stage 310 of FIG. 3. As an introduction, we imagine a scenario in which the activations of a layer in a neural network 445 are to be used as processed input 450 to another downstream module” teaches a pre-processing module (corresponds to pretrained in a pretraining stage) in the activations of a layer in a neural network (corresponds to the first layer). Col. 19 Lines 31-40, “the deep architecture parameters 630 and the optimization 620 hyperparameters are systematically perturbed over an iterative procedure to search for an improved combination of deep embedding architecture parameters 630 and the set of parameters that govern the optimization 620 that searches for the settings of one or more free parameters in the deep embedding architecture (i.e., the parameterization of one or more weights and biases and other parameters in the deep embedding architecture 630)” teaches mitigating perturbations (corresponds to performing adversarial regularization) based on the optimized parameters in the deep embedding architecture (corresponds to the first layer and the second layer)). 
output the result (Kaufhold et al., Col. 27 Lines 11-13, “the low dimensional embedding output and associated metadata 356 of the deep embedding module for each high dimensional input object 354” teaches outputting the low dimensional imbedding (corresponds to the result).
Regarding Claim 12,
Kaufhold et al. teaches the system of claim 11,
Kaufhold et al. further teaches wherein the input data comprises a query and a document, and the result characterizes similarity of the query to the document (Kaufhold et al., Col. 29 Lines 39-57, “A user interface can include one or more graphical elements such as, for example… dialog box, static text, text box… A textual and/or graphical element can be used for… query” teaches the input from the user interface comprises query. Col. 5 Lines 5-6, “All embeddings produce pairs, where each single input object (e.g., a high dimensional object) is paired with one output object (e.g., a low dimensional embedded object)” teaches the embedding compares two input objects (corresponds to result characterizes similarity of the query to the document). Col. 2 Lines 54-59, “to cope with high dimensionality in the computation of meaningful similarities between images is to embed a high dimensional data object into a lower dimensional space that still captures most of the objects' salient properties (i.e., its features and/or similarity properties)” teaches the lower dimensional space (corresponds to the result) captures the similarity properties of the high dimensional data object (corresponds to the query to the document)).
Regarding Claim 13,
Kaufhold et al. teaches the system of claim 11,
Kaufhold et al. further teaches wherein the input data comprises a sentence and the result characterizes a sentiment of the sentence (Kaufhold et al., Col. 16 Lines 40-43, “the hidden layers of the deep analyzer 445 learn to represent increasingly abstract concepts in the raw high dimensional objects 410 (e.g., concepts such as images, speech, or sentences)” teaches the high dimensional objects (corresponds to input data) comprises a sentence. Col. 24 Lines 15-21, “the user is provided a view of every high dimensional object 311/410/510 corresponding to every embedded object 520 in sequence of increasing distance in the embedded space. The user views objects 510 (one view at a time or in group views ordered by distance in the embedded space) and decides only whether all objects in the current view inherit the specific label 420” teaches providing the user a view (corresponds to a sentiment) of every high dimensional object (corresponds to the sentence)).
Regarding Claim 14,
Kaufhold et al. teaches the system of claim 11,
Kaufhold et al. further teaches wherein the input data comprises an image and the result characterizes an object that is present in the image (Kaufhold et al., Col. 16 Lines 40-43, “the hidden layers of the deep analyzer 445 learn to represent increasingly abstract concepts in the raw high dimensional objects 410 (e.g., concepts such as images, speech, or sentences)” teaches the high dimensional objects (corresponds to input data) comprises a image. Col. 5 Lines 5-7, “All embeddings produce pairs, where each single input object (e.g., a high dimensional object) is paired with one output object (e.g., a low dimensional embedded object)” teaches the output object (corresponds to the result characterizes an object that is present in the image)).
Regarding Claim 15,
Kaufhold et al. teaches the system of claim 11, wherein the computer-readable instructions, when executed by the hardware processing unit, cause the hardware processing unit to 
Kaufhold et al. further teaches pretrain the first layer using unsupervised learning (Kaufhold et al., Col. 12 Lines 1-2, “The parametric t-SNE approach (Van der Maaten, 2009) is separated into three distinct stages, proceeding through (1) pretraining… a sampling-based pretraining step, in general, teaches away from improved optimization properties” teaches performing a pretraining stage. Col. 17 Lines 13-16, “training a deep analyzer 430 may comprise unsupervised learning of a reconstruction function, such as a deep convolutional autoencoder” teaches unsupervised learning for training a deep analyzer (corresponds to the parameters of the one or more embedding layers)).
tune the first layer and the second layer using virtual adversarial regularization (Kaufhold et al., Col. 13 Lines 7-10, “tuning the parameters of a deep embedding architecture to reproduce, as reliably as possible, the generated embedding for each training sample (i.e., training the deep embedding architecture)” teaches performing a tuning step on the parameters of a deep embedding architecture. Col. 19 Lines 31-40, “the deep architecture parameters 630 and the optimization 620 hyperparameters are systematically perturbed over an iterative procedure to search for an improved combination of deep embedding architecture parameters 630 and the set of parameters that govern the optimization 620 that searches for the settings of one or more free parameters in the deep embedding architecture (i.e., the parameterization of one or more weights and biases and other parameters in the deep embedding architecture 630)” teaches mitigating perturbations (corresponds to performing adversarial regularization) based on the optimized parameters).
Regarding Claim 17,
Kaufhold et al. teaches a system comprising (Kaufhold et al., FIG. 3 and Col. 15 Lines 65-67, “FIG. 3 illustrates an exemplary embodiment of a method and system for deep embedding and its deployment, in accordance with the present invention” teaches a system).
a hardware processing unit (Kaufhold et al., Col. 28 Lines 56-58, “Processor 930 may comprise a device and/or set of machine-readable instructions for performing one or more predetermined tasks” teaches a hardware processing unit).
a storage resource storing computer-readable instructions which, when executed by the hardware processing unit, cause the hardware processing unit to (Kaufhold et al., Col. 30 Lines 25-32, “a computer-readable storage medium or device which when loaded into a computer system is able to carry out the different methods described herein. “Computer program” in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or indirectly” teaches a computer-readable storage medium (corresponds to a storage resources) storing a set of instructions that are executed by the processor (corresponds to hardware processing unit))
obtain a machine learning model (Kaufhold et al., Col. 30 Lines 20-25, “FIG. 7 illustrates an exemplary embodiment of the modules of Step Three, the export of a deep embedding architecture module 340. A trained deep embedding architecture 337/630 may, for one or more reasons, including the size of the network, the batch size, or other reason, be impractical or incompatible with some deployed hardware” teaches obtaining a deep embedding architecture module (corresponds to a machine learning model)). 
perform a supervised learning process on the machine learning model, the supervised learning process comprising adjusting parameters of the machine learning model based at least on (Kaufhold et al., Col. 14 Lines 48-59, “Sparse (often high cost) labeled image data from an operational imaging sensor is often available for supervised learning of a task (say, object recognition). Sparse, in this context, means that there is, in general, insufficient quantity of labeled training image data to achieve a specific, desired high performance metric requirement with a supervised machine learning algorithm known in the art, but if more labeled image data were available for training, the supervised machine learning algorithm could achieve the desired performance. To increase the quantity of training data available, one approach is to augment existing image sensor data with data rendered from a model” teaches supervised learning on a machine learning model that comprises increasing the quantity of training data (corresponds to adjusting parameters of the machine learning model)).
training loss over labeled training samples, the labeled training samples comprising model inputs and corresponding labels, and deviations in model output of the machine learning model caused by adding noise to the model inputs (Kaufhold et al., Col. 20 Lines 29-35, “the loss function (L in FIG. 6) may correspond to a mean square error between the embedding of the training data and the approximation to the embedding computed by the deep architecture” teaches the loss function (corresponds to training loss) over the embedding of the training data (corresponds to labeled training samples). Col. 16-17 Line 67 and Lines 1-3, “where labeled categories 420/311 of high dimensional objects 410/311 are available, training 430 a deep analyzer 440 may comprise supervised learning of object categories with a convolutional neural network” teaches the labeled training data comprising high dimensional objects (corresponds to model inputs) and corresponding labels. Col. 13 Lines 4-6, “designing a training strategy (i.e., tuning optimization algorithm hyper-parameters, including learning rate, momentum, dropout rate by layer, etc.)” teaches an optimization process (corresponds to adding noise) on the hyper-parameters (corresponds to model inputs) to obtain optimized parameters for training. Col. 8 Lines 28-45, “embedding that illustrates many of the practical difficulties of formal embeddings described above, Van der Maaten & Hinton (Van der Maaten & Hinton, 2008) explains that the process of Stochastic Neighbor Embedding (“SNE”) starts by converting high-dimensional Euclidean distances (with optional weightings) between high dimensional objects into conditional probabilities that represent similarities between the high dimensional objects… The standard deviation, si, for every object, xi, is computed by searching for the value of si that yields an approximately fixed perplexity, where perplexity is 2H(Pi) and H(Pi) is the Shannon entropy (in bits) of the induced distribution over all high dimensional objects… The corresponding low dimensional embedded vectors corresponding to high dimensional objects xi and xj are yi and yj, respectively. That is, there is a one-to-one mapping (correspondence) between high dimensional objects (each xi) and low dimensional embedded vectors (each yi)” teaches the standard deviation in the low dimensional embedded vectors (corresponds to model output) of the deep embedding architecture module (corresponds to a machine learning model). Col. 20 Lines 58-67 “optimization process 620 parameters include… parameters governing data augmentation (such as adding noise or deforming, translating and/or rotating high dimensional objects during training)” teaches adding noise to the high dimensional objects (corresponds to the model inputs) during training).
Regarding Claim 18,
Kaufhold et al. teaches the system of claim 17, wherein the computer-readable instructions, when executed by the hardware processing unit, cause the hardware processing unit to:
Kaufhold et al. further teaches estimate an adversarial direction in which to add the noise, wherein the adversarial direction for a particular input is a direction in which adding noise to the particular input causes greatest deviation in the model output (Kaufhold et al., Col. 7 Lines 41-51, “To mitigate the perturbations of all low dimensional embedded objects from the addition and/or removal of one or more high dimensional input objects to be embedded, one could initialize embedding forces from an existing embedding. Forces could be added for all added objects and removed for all removed objects. The embedding process could then be forward-propagated a few time steps from where it was stopped with the new population. But the key concerns of modifying existing embedding algorithms (whether the algorithm is completely restarted or only perturbed from a former state near equilibrium)” teaches determining perturbations direction (corresponds to adversarial direction) in which to add high dimensional input objects to be add (corresponds to the noise). Col. 8 Lines 18-34, “The similarity of high dimensional object, xj, to high dimensional object, xi, is the conditional probability, p(j|i), that xi would pick xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at high dimensional object, xi. The self terms, p(i|i), are set to zero, leaving only the pairwise similarities nonzero. For nearby high dimensional object pairs, xi and xj, p(j|i) is relatively high, whereas for widely separated high dimensional objects, p(j|i) will be almost zero (for reasonable values of the variance of the Gaussian, si). The standard deviation, si, for every object, xi, is computed by searching for the value of si that yields an approximately fixed perplexity, where perplexity is 2H(Pi) and H(Pi) is the Shannon entropy (in bits) of the induced distribution over all high dimensional object” teaches direction in which adding the high dimensional objects (corresponds to noise to the particular input) causes greatest standard deviation in the low dimensional embedded vectors (corresponds to model output)). 
Regarding Claim 19,
Kaufhold et al. teaches the system of claim 17, 
Kaufhold et al. further teaches wherein the machine learning model comprises a layer that outputs word or token embeddings (Kaufhold et al., Col. 12 Lines 22-24, “a formal PCA embedding of the high dimensional objects (words) using a word2vec embedded space, discovered in some cases by a deep architecture” teaches the deep architecture (corresponds to the machine learning model comprising a layers) that output the high dimensional objects (corresponds to words)).
the computer-readable instructions, when executed by the hardware processing unit, cause the hardware processing unit to:
add the noise to the word or token embeddings (Kaufhold et al., Col. 20 Lines 65-67, “parameters governing data augmentation (such as adding noise or deforming, translating and/or rotating high dimensional objects during training” teaches adding noise to the high dimensional objects (corresponds to the word)).
Regarding Claim 20,
Kaufhold et al. teaches the system of claim 17, 
Kaufhold et al. further teaches wherein the supervised learning process further comprises adjusting the parameters based at least on deviations in model output of a current iteration of the machine learning model relative to model output of at least one previous iteration of the machine learning model (Kaufhold et al., Col. 14 Lines 48-59, “Sparse (often high cost) labeled image data from an operational imaging sensor is often available for supervised learning of a task (say, object recognition). Sparse, in this context, means that there is, in general, insufficient quantity of labeled training image data to achieve a specific, desired high performance metric requirement with a supervised machine learning algorithm known in the art, but if more labeled image data were available for training, the supervised machine learning algorithm could achieve the desired performance. To increase the quantity of training data available, one approach is to augment existing image sensor data with data rendered from a model” teaches supervised learning on a machine learning model that comprises increasing the quantity of training data (corresponds to adjusting parameters of the machine learning model). Col. 8 Lines 29-34, “The standard deviation, si, for every object, xi, is computed by searching for the value of si that yields an approximately fixed perplexity, where perplexity is 2H(Pi) and H(Pi) is the Shannon entropy (in bits) of the induced distribution over all high dimensional objects” teaches the standard deviation for every object. Col. 9 Lines 44-49, “add a small amount of noise to every embedded low dimensional vector at the end of every early iteration of the gradient descent at the same time that a momentum term is gradually reduced” teaches the deviation in every embedded low dimensional vector (corresponds to model output) of a current iteration of the deep embedding architecture (corresponds to a machine learning model) relative to low dimensional vector of at least one previous iteration of the deep embedding architecture).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2-4, 10, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kaufhold et al. in view of Chai et al. (US11429862B2)
Regarding Claim 2,
Kaufhold et al. teaches the method of claim 1, 
Kaufhold et al. further teaches wherein the adjusting comprises computing a loss function comprising (Kaufhold et al., Col. 19 Lines 4-9, “the learn deep embedding architecture stage 330 is comprised of a train deep embedding architecture module 335/620 that, in an embodiment, effects a supervised learning of input and embedded object pairs 333/610 such that the learned deep embedding architecture 630 optimizes a loss function” computing a loss function).
… a second term that is proportional to a difference between output of the machine learning model for the labeled training samples and output of the machine learning model for the noise-adjusted training samples (Kaufhold et al., Col. 25 Lines 66-67, “a loss function that measures the difference between real sensor images and synthetic images” teaches the difference between real sensor images (corresponds to output of the machine learning model for the labeled training samples) and synthetic images (corresponds to the machine learning model for the noise-adjusted training samples)).
Kaufhold et al. does not appear to explicitly teach a first term that is proportional to a difference between predictions of the machine learning model and labels of the labeled training samples
However, Chai et al., teaches a first term that is proportional to a difference between predictions of the machine learning model and labels of the labeled training samples (Chai et al., FIG. 4 and Col. 22 and Lines 63-67, “the intermediate loss function is based on the data-label pair (X(l), y), the first output data (X(l)), and the second set of weights (W). Equations (7) and (8) show examples of the intermediate loss function. Thus, the first input data set comprises a batch of training data-label pairs,” teaches the difference between the output data of the neural network (corresponds to predictions of the machine learning model) compared to the training data-label (corresponds to labels of the labeled training samples)).
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to have a first term that is proportional to a difference between predictions of the machine learning model and labels of the labeled training samples, as taught by Chai et al., to an adversarial training of machine learning models of Kaufhold et al. The motivation to solve the overfitting problem in machine learning (Chai et al., Col. 7 Lines 23-26, “Regularization is a technique used to solve the overfitting problem in machine learning. In general, regularization techniques work by adding a penalty term to the loss function used in training a DNN”).
Regarding Claim 3,
Kaufhold et al. teaches the method of claim 1, 
Kaufhold et al. further teaches wherein the tuning stage comprises multiple tuning iterations, the method further comprising (Kaufhold et al., Col. 19 Lines 31-33, “the deep architecture parameters 630 and the optimization 620 hyperparameters are systematically perturbed over an iterative procedure” teaches an iterative procedure of tuning). 
Kaufhold et al. does not appear to explicitly teach a determining a difference between output of a current iteration of the machine learning model and output of at least one previous iteration of the machine learning model and constraining the adjusting of the parameters based at least on the difference
However, Chai et al., teaches determining a difference between output of a current iteration of the machine learning model and output of at least one previous iteration of the machine learning model (Chai et al., Col. 15 Lines 40-44, “The distillation loss indicates a difference between the output generated by DNN 106 when machine learning system 104 runs DNN 106 on the same input using high-precision weights 114 (W) and using low-precision weights 116” teaches determining the difference between output generated by the DNN (corresponds to machine learning model) when input utilizes high precision weight (corresponds to current iteration) and utilizing low-precision weight (corresponds to at least one previous iteration)).
constraining the adjusting of the parameters based at least on the difference (Chai et al., Col. 15 Lines 28-30, “where low-precision weights 116 are constrained to integer powers of 2, machine learning system 104 may use the loss function” teaches constraining the weights (corresponds to the adjusting of the parameters) based on the loss function (corresponds to the difference)).
Regarding Claim 4,
The Kaufhold et al. in view of Chai et al. combination of claim 3 teaches the method of claim 3,
The combination, as described in the rejection of claim 3, further teaches wherein the adjusting comprises performing adversarial regularization based at least on the noise-adjusted training samples (Kaufhold et al., Col. 19 Lines 31-40, “the deep architecture parameters 630 and the optimization 620 hyperparameters are systematically perturbed over an iterative procedure to search for an improved combination of deep embedding architecture parameters 630 and the set of parameters that govern the optimization 620 that searches for the settings of one or more free parameters in the deep embedding architecture (i.e., the parameterization of one or more weights and biases and other parameters in the deep embedding architecture 630)” teaches mitigating perturbations (corresponds to performing adversarial regularization) based on the optimized parameters (corresponds to noise-adjusted training samples)).
performing proximal point updating of the parameters based at least on the difference (Chai et al., Col. 17 Lines 31-33, “Note that l({tilde over (W)}) is a convex and differentiable relaxation of the negative log likelihood with respect to the quantized parameters” teaches solving a convex optimization problem (corresponds to performing proximal point updating) to the quantized parameters based on differentiable relaxation).
Regarding Claim 10,
Kaufhold et al. teaches the method of claim 9,
Kaufhold et al. further teaches a pairwise ranking layer (Kaufhold et al., Col. 24 Lines 10-13, “a distance threshold from the specific high dimensional object in the embedded space, a count of the number of objects from the closest to the furthest in rank order” teaches a ranking order of the objects in the deep embedding architecture (corresponds to a pairwise ranking layer)).
Kaufhold et al. does not appear to explicitly teach wherein the task-specific layer is selected from group comprising a single-sentence classification layer, a pairwise text similarity layer, a pairwise text classification layer
However, Chai et al., teaches wherein the task-specific layer is selected from group comprising a single-sentence classification layer, a pairwise text similarity layer, a pairwise text classification layer (Chai et al., Col. 4 Lines 58-64, “DNN 106 receives input data from an input data set 110 and generates output data 112… Output data 112 may include classification data, translated text data” teaches the output of the DNN including  classification data (corresponds to a single-sentence classification layer) and translated text data (corresponds to a pairwise text similarity layer). Col. 22-23 Line 67 and Lines 1-17, “the first input data set comprises a batch of training data-label pairs, and as part of determining the first operand, machine learning system 104 may determine the first operand… where B is a total number of data-label pairs in the batch of data-label pairs, each label in the batch of data-label pairs is an element in a set of labels that includes B labels, i is an index, log(⋅) is a logarithm function, N is the total number of layers in the plurality of layers, yi is the i′th label in the set of labels, and Xi,y i (N) is output of the N′th layer of the plurality of layers when DNN 106 is given as input the data of the i′th data-label pair of the batch of data-label pairs” teaches a pairwise text classification layer.). 
Regarding Claim 16,
Kaufhold et al. teaches the system of claim 15, wherein the computer-readable instructions, when executed by the hardware processing unit, cause the hardware processing unit to 
Kaufhold et al. does not appear to explicitly teach tune the first layer and the second layer using a proximal point mechanism
However, Chai et al., teaches tune the first layer and the second layer using a proximal point mechanism (Chai et al., Col. 17 Lines 31-33, “Note that l({tilde over (W)}) is a convex and differentiable relaxation of the negative log likelihood with respect to the quantized parameters” teaches solving a convex optimization problem (corresponds to performing proximal point updating) to the quantized parameters based on differentiable relaxation deep neural network (corresponds to the first layer and the second layer)).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 8:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000

/HENRY TRONG NGUYEN/
Examiner, Art Unit 2125
/BRIAN M SMITH/Primary Examiner, Art Unit 2122