Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2020-04-26 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
The amendment filed 2021-07-16 has been entered.  Claims 1-20 remain pending in the application.  Applicant’s amendments to the claims overcome each and every 112(b) rejection previously set forth in the Non-Final Office Action mailed 2021-04-26.  Applicant’s amendments to the claims also overcome the 101 rejections, as suggested by the Examiner during the interview on 2021-06-30.
Response to Arguments
Applicant's arguments with respect to rejections under 35 U.S.C. 103 have been fully considered but they are not persuasive.  Applicant argues that the combination of Ranzato and Gaidon does not teach “sampling a candidate auxiliary output from a plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary outputs”, and argues the Gaidon [0037] does not teach or suggest a score distribution. Examiner had shown that Gaidon hints at this concept, as Gaidon’s probability for each candidate is effectively a “distribution”, as a distribution is a collection of probabilities.  Examiner points out that, as was also shown in the Non-Final Office Action Claim 10 mapping, probability distributions may be stored representing the accumulated candidate output values”, which is followed by sampling the candidate output values in [0058]: “Optionally sampling may be used to select candidate output values to be accumulated and stored in order to maintain a low memory footprint.”  Gaidon [0037] was recited by Examiner in the interest of a thorough mapping showing how all the art applies, but was not strictly necessary in light of the ensuing combination with Kohli.  Ranzato, Gaidon, and Kohli are analogous art because they are in the field of endeavor of machine learning.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task of Ranzato and Gaidon, with the sampling from a distribution of Kohli.  The modification would have been obvious because one of ordinary skill in the art would have been motivated to reduce resource usage, as structured prediction tasks often have multiple possibly correct outputs, and training on every single candidate output for each piece of training data would be resource intensive 
Applicant also argues that “according to a task reward function for the machine learning task” is not taught by Gaidon.  Examiner agrees with this argument.  However, Applicant also argues that “The action has not asserted that the cited portions of Volkovs, Sahba, Hussain, Kohli, Risholm, Modarresi, and Vasseur disclose these features.”  Examiner respectfully disagrees.  Examiner firstly points out that, as discussed above, Kohli teaches the “score distribution” as had been stated in the Non-Final Rejection for Claim 10, and is currently stated in the present 103 rejection for Claim 1 below.  Finally, Sahba teaches the “task reward function” as had been stated in the Non-Final Rejection for Claim 5, and is currently stated in the present 103 rejection for Claim 1 below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5, 7, 9, 10, 16-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ranzato et. al. (“Sequence Level Training with Recurrent Neural Networks”; hereinafter Ranzato) in view of Gaidon (US 2017/0286774 A1), Sahba et. al. (“A reinforcement Sahba), and Kohli et. al. (US 2013/0156298 A1; hereinafter Kohli).
As per Claim 1, Ranzato teaches a computer-implemented method comprising: obtaining data identifying a neural network to be trained to perform a machine learning task, the neural network being configured to receive an input example and to process the input example in accordance with current values of a plurality of model parameters to generate a model output for the input example (Ranzato, Section 3 Para 1, discloses:  “The learning algorithms we describe in the following sections are agnostic to the choice of the underlying model, as long as it is parametric. In this work, we focus on Recurrent Neural Networks (RNNs) as they are a popular choice for text generation. In particular, we use standard Elman RNNs (Elman, 1990) and LSTMs (Hochreiter & Schmidhuber, 1997). For the sake of simplicity but without loss of generality, we discuss next Elman RNNs. This is a parametric model that at each time step t, takes as input a word wt 2 W as its input, together with an internal representation ht. W is the the vocabulary of input words. This internal representation ht is a real-valued vector which encodes the history of words the model has seen so far. Optionally, the RNN can also take as input an additional context vector ct, which encodes the context to be used while generating the output. In our experiments ct is computed using an attentive decoder inspired by Bahdanau et al. (2015) and Rush et al. (2015), the details of which are given in Section 6.2 of the supplementary material. The RNN learns a recursive function to compute ht and outputs the distribution over the next word:”  Here, Ranzato discloses a neural network (“Recurrent Neural Network”) to be trained to perform a machine learning task (“text generation”).  The model receives an input example (“takes as input a word”), and processes the input example in accordance with current values of a plurality of model parameters (“agnostic to the choice of the underlying model, as long as it is parametric”), to generate a model output (“outputs the distribution over the next word”).  The Recurrent Neural Network must be implemented by a computer, and the computer must obtain the data identifying the RNN in order to execute the learning task.)
obtaining initial training data for training the neural network, the initial training data comprising a plurality of training examples and, for each training example, a ground truth output; (Ranzato, Section 3.2.2 Lines 5-9, discloses “We start from the optimal policy and then slowly deviate from it to let the model explore and make use of its own predictions. We first train the RNN with the cross-entropy loss for NXENT epochs using the ground truth sequences. This ensures that we start off with a much better policy than random because now the model can focus on a good part of the search space.”  Here, Ranzato discloses that the initial training data (“we first train”) comprises a plurality of training examples (training for “NXENT epochs”, wherein an “epoch” is known in the art as a complete pass of an entire training dataset), with a ground truth output (“using the ground truth sequences”).  The “sequence” is the output, as disclosed by Ranzato, Introduction:  “From a machine learning perspective, text generation is the problem of predicting a syntactically and semantically correct sequence of consecutive words given some context. For instance, given an image, generate an appropriate caption or given a sentence in English language, translate it into French.”).
generating modified training data from the initial training data (Ranzato, Fig 4 Caption, discloses: “Figure 4: Illustration of MIXER. In the first s unrolling steps (here s = 1), the network resembles a standard RNN trained by XENT. In the remaining steps, the input to each module is a sample from the distribution over words produced at the previous time step. Once the end of sentence is reached (or the maximum sequence length), a reward is computed, e.g., BLEU. REINFORCE is then used to back-propagate the gradients through the sequence of samplers. We employ an annealing schedule on s, starting with s equal to the maximum sequence length T and finishing with s = 1.”  Here, after the initial training epochs, in the next training epochs (“back-propagate” is part of training), the training data is modified dynamically, as each word is added to the predicted sequence (“In the remaining steps, the input to each module is a sample from the distribution over words produced at the previous time step”)
However, Ranzato does not teach that the generation of modified training data from the initial training data is comprising, for each of one or more training examples of the plurality of training examples in the initial training data: generating an auxiliary output for the training example from the ground truth output for the training example by sampling a candidate auxiliary output from a plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary outputs that is generated from, for each of the plurality of candidate auxiliary outputs, a respective measure of the similarity of the candidate auxiliary output to the ground truth output for the training example, and replacing the ground truth output for the training example according to a task reward function for the machine learning task with the auxiliary output for the training example; and training the neural network on the modified training data.
Gaidon teaches that the generation of modified training data from the initial training data is comprising, for each of one or more training examples of the plurality of training [sampling] selecting a candidate auxiliary output from a plurality of candidate auxiliary outputs in accordance with [a score distribution over the plurality of candidate auxiliary outputs that is generated from, for each of the plurality of candidate auxiliary outputs,] a respective measure of the similarity of the candidate auxiliary output to the ground truth output for the training example [according to a task reward function for the machine learning task] (Gaidon, Para [0053], discloses:  “First, the system learns from pairs of bounding boxes. The KITTI training ground truth tracks—without any data augmentation—provides approximately 100K training samples when down-sampling the negative sample pairs to yield the same number as all possible positive ones. This can be further increased by using either jittering, allowing for time-skips, or replacing ground truth annotations by strongly-overlapping detections (or even object proposals). These data augmentation strategies are contemplated to boost recognition performance further, or at least contribute to or prevent over-fitting.”  Here, Gaidon discloses, for each of one or more (the quantity is not specified in the claim nor the reference) training examples in the initial training data (“The KITTI training ground truth tracks—without any data augmentation—provides approximately 100K training samples”), generating an auxiliary output for the training example from the ground truth example:  “replacing ground truth annotations by strongly overlapping detections (or even object proposals)”, wherein the auxiliary output (annotations) are generated by strongly overlapping detections or object proposals.  Gaidon also discloses selecting from a plurality of candidates based on a respective measure of similarity between the candidate auxiliary output and the ground truth output, as the ground truth annotations are replaced by “strongly-overlapping” detections.  Here it helps to know that Gaidon is in the field of image object recognition, as stated in the Introduction:  “A system for applying video data to a neural network (NN) for online multi-class multi-object tracking includes a computer programed to perform an image classification method including the operations of receiving a video sequence; detecting candidate objects in each of a previous and a current video frame.”  In a video frame, there may be a plurality of candidates, and so a “strongly-overlapping detection” may be one of a plurality of nearby candidates, and “strongly-overlapping” is a measure of the similarity of the candidate auxiliary output to the ground truth output, as an annotation indicating this object would be much more similar to the ground truth than an annotation indicating a completely foreign object not in the frame at all. *According to a task reward function will be taught by Sahba below
Gaidon, Para [0037], discloses the concept of a score distribution over candidates:  “In other words, the neural network 56 takes the candidate objects—which can be associated with a target object being tracked in a given previous frame and a new or existing object in a current frame—and creates a data association matrix—a number of targets by a number of detections. The neural network 56 generates a probability (“association score”) in each cell of the matrix corresponding that the detected object in the current frame matches the target object in the previous frame”.  This score is based on a measure of similarity between the ground truth (“target object”) and the candidate object.  The score is described as a probability, and is done for all candidate objects.  A collection of probabilities comprises a distribution.)
(Gaidon Para [0053] discloses replacing the ground truth output with the auxiliary output: “replacing ground truth annotations by strongly-overlapping detections (or even object proposals)”)
and training the neural network on the modified training data.  (Gaidon Para [0053] discloses that this data with replaced ground truth annotations will be used for training, as Gaidon recites in the last sentence: “These data augmentation strategies are contemplated to boost recognition performance further, or at least contribute to or prevent over-fitting”, wherein overfitting is a known challenge with training).
Ranzato and Gaidon are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured output neural network of Ranzato, with the modified training data with replaced ground truths of Gaidon. The modification would have been obvious because one of ordinary skill in the art would be motivated to boost performance and prevent over-fitting. (Gaidon, [0053]: “These data augmentation strategies are contemplated to boost recognition performance further, or at least contribute to or prevent over-fitting”)
However, the combination of Ranzato and Gaidon teaches thus far fails to explicitly teach sampling a candidate auxiliary output from the plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary outputs
Kohli teaches sampling a candidate auxiliary output from the plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary  (Kohli, Para [0058], discloses “This is the training stage and so particular image elements which reach a given leaf node have specified output values and global variables known from the ground truth training data. A representation of the accumulated candidate output values may be stored 930 using various different methods. The candidate output values may be stored according to global variable bin ranges as described above with reference to FIGS. 3 and 4. Optionally sampling may be used to select candidate output values to be accumulated and stored in order to maintain a low memory footprint.” Here, Kohli discloses sampling from a plurality of candidate output values.  The term “sampling” implies the use of a distribution.  This is evidenced by Kohli [0047]: “In operation, each root and split node of each tree performs a binary test on the input data and based on the result directs the data to the left or right child node. The leaf nodes do not perform any action; they store accumulated candidate output values (and global variable predictions in the embodiments described above with reference to FIGS. 6 and 8). In the case of joint position detection the candidate output values are joint offset vectors representing a distance and direction of an image element from a joint position. For example, probability distributions may be stored representing the accumulated candidate output values”. Additionally, Gaidon as shown above discloses a score distribution over the plurality of candidate auxiliary outputs.)
Ranzato, Gaidon, and Kohli are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task of Ranzato and Gaidon, with the sampling of 
However, the combination of Ranzato, Gaidon, and Kohli does not explicitly teach wherein the measure of the similarity of the candidate auxiliary output to the ground truth output is a value of a task reward function for the machine learning task for the candidate auxiliary output.
Sahba teaches wherein the measure of the similarity of the [candidate auxiliary] output to the ground truth output is a value of a task reward function for the machine learning task for the [candidate auxiliary] output. (Sahba, Abstract Lines 3-5, discloses:  “The agent uses some images and their ground-truth (manually segmented) version to learn from. A reward function is employed to measure the similarities between the output and the manually segmented images, and to provide feedback to the agent.”  Here, Sahba discloses that the ground truth is a manually segmented image, then discloses that the measure of the similarities between an output and the ground truth is the value of a reward function, which amounts to measuring the similarities between a candidate auxiliary output and the ground truth output when combined with Ranzato, Gaidon, and Kohli.)
Ranzato, Gaidon, Kohli, and Sahba are analogous art because they are in the field of machine learning.


	As per Claim 2, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 1 as shown above, as well as wherein the machine learning task is a structured output prediction task.  (Ranzato, Intro Para 3, discloses “This paper proposes a novel training algorithm which results in improved text generation compared to standard models”.  Text generation is a structured output prediction task, as it does not produce simply a scalar value as its input, but in this case a sequence of words.  Ranzato, Section 2 Para 2, casts text generation as a structured prediction problem:  “The idea of improving generation by letting the model use its own predictions at training time (the key proposal of this work) was first advocated by Daume III et al. (2009). In their seminal work, the authors first noticed that structured prediction  problems can be cast as a particular instance of reinforcement learning”)

	As per Claim 3, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 1 as shown above, as well as wherein training the neural network on the modified training data comprises training the neural network to generate model outputs for the training examples that match the auxiliary outputs for the training examples using a gradient descent training technique (Gaidon, Para [0053], discloses training the neural network to match auxiliary outputs rather than the ground truth training examples:  “First, the system learns from pairs of bounding boxes. The KITTI training ground truth tracks—without any data augmentation—provides approximately 100K training samples when down-sampling the negative sample pairs to yield the same number as all possible positive ones. This can be further increased by using either jittering, allowing for time-skips, or replacing ground truth annotations by strongly-overlapping detections (or even object proposals). These data augmentation strategies are contemplated to boost recognition performance further, or at least contribute to or prevent over-fitting.”  Gaidon, Para [0056], also discloses training with gradient descent technique:  “The parameters of the disclosed neural network can be learned from scratch on training videos labeled with ground truth tracks using standard stochastic gradient descent with momentum and the hyper-parameters”.  Examiner’s Note:  Though Gaidon recites “learned from scratch on training videos labeled with ground truth tracks”, the mere suggestion that the training videos are labeled does not imply that the labels may not be replaced as in Para [0053], nor that some alternative to gradient descent is used if the ground truth labels are replaced.)

As per Claim 5, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 1 and the measure of the similarity of the candidate auxiliary output to the ground truth output as shown above.  Sahba teaches wherein the measure of the similarity of the [candidate auxiliary] output to the ground truth output is a value of a task reward function for the machine learning task for the [candidate auxiliary] output. (Sahba, Abstract Lines 3-5, discloses:  “The agent uses some images and their ground-truth (manually segmented) version to learn from. A reward function is employed to measure the similarities between the output and the manually segmented images, and to provide feedback to the agent.”  Here, Sahba discloses that the ground truth is a manually segmented image, then discloses that the measure of the similarities between an output and the ground truth is the value of a reward function, which amounts to measuring the similarities between a candidate auxiliary output and the ground truth output when combined with Ranzato and Gaidon.)

As per Claim 7, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 5 as shown above, as well as wherein the machine learning task is a machine translation task, and wherein the task reward function is a BLEU score for the candidate auxiliary output.  (Ranzato, Sec 4.2 “Machine Translation”, discloses a machine translation task, as one of the several machine learning tasks they have used their model for:  “For the translation task, our generative model is an LSTM with 256 hidden units and it uses the same attentive encoder architecture as the one used for summarization.”  Ranzato, Fig. 4 caption, discloses using BLEU as the reward function:  “Once the end of sentence is reached (or the maximum sequence length), a reward is computed, e.g., BLEU”)

As per Claim 9, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 5 as shown above, as well as wherein the machine learning task is an image masking task, and wherein the task reward function is based on (i) a union of pixels that are masked in the candidate auxiliary output and pixels that are masked in the ground truth output and (ii) an intersection of pixels that are masked in the candidate auxiliary output and pixels that are masked in the ground truth output. (Sahba, Abstract Lines 3-5, discloses:  “The agent uses some images and their ground-truth (manually segmented) version to learn from. A reward function is employed to measure the similarities between the output and the manually segmented images, and to provide feedback to the agent.”  Here, Sahba discloses “segmented images”.  “Image segmentation” is another term for “image masking” (see extrinsic evidence https://www.tensorflow.org/tutorials/images/segmentation : “Thus, the task of image segmentation is to train a neural network to output a pixel-wise mask of the image”).  Sahba also discloses a reward function for the image masking task.  Sahba, Section 3.4, further elaborates:  “The rewards and punishments can be defined based on a quality criterion representing how well the object has been segmented in each sub-image. Several criteria can be used for this purpose. A straightforward method is to compare the results with the ground-truth image after each action.  To measure this value for each sub-image, we note that how much the quality has changed after the action. In each sub-image, to improve the quality of the segmented object the agent receives rewards; otherwise it will be punished. A general form for the reward function can be represented as follows [Eq 10] where D is a measure indicating the difference between the quality after and before taking the action. It can be calculated using the normalized number of misclassified pixels in the segmented sub-images.”  Here, Sahba discloses the reward function as being based on the “normalized number of misclassified pixels in the segmented sub-images”.  The “number of misclassified pixels” is based on the union of pixels that are masked in the candidate auxiliary output and the ground truth output (in order to identify misclassified pixels, one must know all pixels that have been masked in both images, which is the union), and an intersection of pixels that are masked in the candidate auxiliary output and in the ground truth output (these are correctly classified pixels).  The difference between the union of masked pixels and the intersection of masked pixels results in the number of misclassified pixels, which is the value of the reward function.)

As per Claim 10, the combination of Ranzato, Gaidon, Kohli, and Sabha teaches the method of claim 1 as shown above, as well as a score distribution over the plurality of candidate auxiliary outputs, wherein the score for each of the candidate auxiliary outputs in the score distribution is based on the measure of the similarity of the candidate auxiliary output to the ground truth output for the training example. (Gaidon, Para [0037], discloses a score distribution over candidates:  “In other words, the neural network 56 takes the candidate objects—which can be associated with a target object being tracked in a given previous frame and a new or existing object in a current frame—and creates a data association matrix—a number of targets by a number of detections. The neural network 56 generates a probability (“association score”) in each cell of the matrix corresponding that the detected object in the current frame matches the target object in the previous frame”.  This score is based on a measure of similarity between the ground truth (“target object”) and the candidate object.  The score is described as a probability, and is done for all candidate objects.  A collection of probabilities comprises a distribution.)
However, the combination of Ranzato and Gaidon teaches does not explicitly teach sampling a candidate auxiliary output from the plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary outputs
Kohli teaches sampling a candidate auxiliary output from the plurality of candidate auxiliary outputs in accordance with a score distribution over the plurality of candidate auxiliary outputs (Kohli, Para [0058], discloses “This is the training stage and so particular image elements which reach a given leaf node have specified output values and global variables known from the ground truth training data. A representation of the accumulated candidate output values may be stored 930 using various different methods. The candidate output values may be stored according to global variable bin ranges as described above with reference to FIGS. 3 and 4. Optionally sampling may be used to select candidate output values to be accumulated and stored in order to maintain a low memory footprint.” Here, Kohli discloses sampling from a plurality of candidate output values.  The term “sampling” implies the use of a distribution.  This is evidenced by Kohli [0047]: “In operation, each root and split node of each tree performs a binary test on the input data and based on the result directs the data to the left or right child node. The leaf nodes do not perform any action; they store accumulated candidate output values (and global variable predictions in the embodiments described above with reference to FIGS. 6 and 8). In the case of joint position detection the candidate output values are joint offset vectors representing a distance and direction of an image element from a joint position. For example, probability distributions may be stored representing the accumulated candidate output values”. Additionally, Gaidon as shown above discloses a score distribution over the plurality of candidate auxiliary outputs.)

As per Claim 16, Claim 16 is a system claim corresponding to method Claim 1.  The difference is that Claim 16 recites one or more computers and one or more storage devices.  (Gaidon, Para [0010], discloses one or more computers: “The system comprises a computer programed to perform a method” and in [0011] one or more storage devices:  “Another embodiment of the present disclosure is directed to a non-transitory storage medium storing instructions readable and executable by a computer”).  Claim 16 is rejected for the same reasons as Claim 1.

As per Claim 17, Claim 17 is a non-transitory computer storage medium claim corresponding to method Claim 1.  The difference is that Claim 17 recites one or more computers and a non-transitory computer storage medium.  (Gaidon, Para [0010], discloses one or more computers: “The system comprises a computer programed to perform a method” and in [0011] a non-transitory computer storage medium:  “Another embodiment of the present disclosure is directed to a non-transitory storage medium storing instructions readable and executable by a computer”).  Claim 17 is rejected for the same reasons as Claim 1.

As per Claim 18, Claim 18 is a system claim corresponding to method Claim 3.  Claim 18 is rejected for the same reasons as Claim 3.

As per Claim 20, Claim 20 is a non-transitory computer storage medium claim corresponding to method Claim 10.  Claim 20 is rejected for the same reasons as Claim 10.

Claims 4 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sahba, further in view of Volkovs et. al. (“Loss-sensitive Training of Probabilistic Conditional Random Fields”; hereinafter Volkovs).
As per Claim 4, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 3 and training the neural network on the modified training data as shown above.  However, the combination of Ranzato, Gaidon, Kohli, and Sabha does not explicitly teach wherein training the neural network on the modified training data comprises training the neural network using maximum likelihood training.
Volkovs teaches training the neural network using maximum likelihood training. (Volkovs, Section 6 Last Paragraph, discloses training using maximum likelihood:  “We trained CRFs according to maximum likelihood as well as the different loss-sensitive objectives described in Section 3”)
Ranzato, Gaidon, Kohli, Sahba, and Volkovs are analogous art because they are in the field of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured output neural network with modified training data of Ranzato, Gaidon, Kohli, and Sahba, with the maximum likelihood training of Volkovs. The modification would have been obvious because one of ordinary skill in the art would be motivated to achieve consistent performance and retain efficiency as the dataset grows. (Volkovs, Section 3 Line 2:  “In the well-specified case and for large datasets, this would probably not be a problem because of the asymptotic consistency and efficiency properties of maximum likelihood”)

As per Claim 19, Claim 19 is a system claim corresponding to method Claim 4.  The difference is that Claim 19 recites one or more computers and one or more storage devices.  (Gaidon, Para [0010], discloses one or more computers: “The system comprises a computer programed to perform a method” and in [0011] one or more storage devices:  “Another embodiment of the present disclosure is directed to a non-transitory storage medium storing instructions readable and executable by a computer”).  Claim 19 is rejected for the same reasons as Claim 4.


Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sahba, further in view of Hussain et. al. (“A comprehensive survey of handwritten document benchmarks: structure, usage and evaluation”; hereinafter Hussain).
As per Claim 6, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 5 as shown above, as well as negative task reward function (Ranzato, Sec 3.2.1, discloses the use of a loss function, also known as a negative reward function:  “We define our loss as the negative expected reward”).
  However, the combination of Ranzato, Gaidon, Kohli, and Sahba does not explicitly teach wherein the machine learning task is a task in which the neural network generates an output that is a sequence of tokens, and wherein the task reward function is a negative edit distance between the ground truth output and the candidate auxiliary output.
Hussain teaches wherein the machine learning task is a task in which the neural network generates an output that is a sequence of tokens, and wherein the task reward function is a [negative] edit distance between the ground truth output and the candidate auxiliary output. (In this section, we discuss the experimental settings and evaluation metrics that are employed by researchers to solve the problems based on analysis of handwriting. As discussed earlier, the most important of these tasks is handwriting recognition which is carried out at character, word and line levels. Consequently, these systems report results in terms of character and word recognition rates. In some cases, the edit distance between the recognized text and ground truth text is used to quantify the recognition performance.  Here, Hussain discloses the machine learning task of handwriting recognition, for which the output is a sequence of tokens (characters).  Hussain also discloses quantifying the recognition performance (i.e., task reward function) using the edit distance.  Ranzato, as shown above, discloses negative task reward function, which results in the negative edit distance.)
	Ranzato, Gaidon, Kohli, Sahba, and Hussain are analogous art because they are in the field of endeavor of machine learning.
	All of the elements of the claims are known in Hussain and the combination of Ranzato, Gaidon, Kohli, and Sahba. The only difference is the combination of the structured prediction task of Ranzato, Gaidon, Kohli, and Sahba with the edit distance of Hussain into a single device. It would have been obvious to one of ordinary skill in the art to incorporate the edit distance of Hussain into the structured prediction task, since the operation of edit distance is in no way dependent on the operation of the other elements of the claims and the edit distance could be used in combination with the structured prediction task to achieve the predictable results of a structured prediction task with edit distance as the reward function.  Edit distance is well known in the art, as Hussain has indicated in their paper, which is a survey of known techniques (see MPEP 2143 KSR A).
	
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sahba, further in view of Graves et. al. (“Towards End-to-End Speech Recognition with Recurrent Neural Networks”; hereinafter Graves).
As per Claim 8, the combination of Ranzato, Gaidon, Kohli, and Sahba teaches the method of claim 5 as shown above, as well as negative task reward function (Ranzato, Sec 3.2.1, discloses the use of a loss function, also known as a negative reward function:  “We define our loss as the negative expected reward”).
However, the combination of Ranzato, Gaidon, Kohli, and Sahba does not explicitly teach wherein the machine learning task is a speech recognition task, and wherein the task reward function is a negative word error rate for the candidate auxiliary output.
Graves teaches wherein the machine learning task is a speech recognition task, and wherein the task reward function is a [negative] word error rate for the candidate auxiliary output.  (Graves, Abstract, discloses a speech recognition task: “This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation”.  Graves, Section 1 Page 2 Bottom of Left column, discloses:  “The basic system is enhanced by a new objective function that trains the network to directly optimize the word error rate.”  Graves, Section 4 Page 5 bottom of left column, describes word error rate as a “loss function”:  “However, for many loss functions (including word error rate) this could be optimized by only recalculating that part of the loss corresponding to the alignment change. For our experiments, five samples per sequence gave sufficiently low variance gradient estimates for effective training.”  Ranzato, as shown above, discloses negative task reward function, which results in the negative word error rate.)

All of the elements of the claims are known in Graves and the combination of Ranzato, Gaidon, Kohli, and Sahba. The only difference is the combination of the structured prediction task of Ranzato, Gaidon, Kohli, and Sahba with the word error rate of Graves into a single device. It would have been obvious to one of ordinary skill in the art to incorporate the word error rate of Graves into the structured prediction task, since the operation of word error rate is in no way dependent on the operation of the other elements of the claims and the word error rate could be used in combination with the structured prediction task to achieve the predictable results of a structured prediction task with word error rate as the reward function.  Word error rate is well known in the art, as Graves indicates in Section 4 Top of Right Column:  “In speech recognition, for example, the standard measure is the word error rate (WER), defined as the edit distance between the true word sequence and the most probable word sequence emitted by the transcriber” (see MPEP 2143 KSR A).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sabha further in view of Santos et. al. (“Improving the fitness of high-dimensional biomechanical models via data-driven stochastic exploration”; hereinafter Santos).
As per Claim 11, the combination of Ranzato, Gaidon, Kohli, and Sabha teaches the method of claim 10 as shown above.  However, the combination of Ranzato, Gaidon, Kohli, and Sabha does not teach wherein the score distribution is a stationary distribution.
Santos teaches wherein the score distribution is a stationary distribution. (Santos, Section II A, discloses sampling from a stationary distribution:  “Ideally, we would know our target distribution π(θ) (the posterior distribution p(θ|X), or the probability of the model parameter set θ given observed data X) and it would be a simple analytical expression or easy to sample from directly. When this is not the case, we can still estimate the target distribution by sampling from a stationary distribution π∗(θ) that is proportional to the target distribution π(θ) and draw inferences on these results. We can estimate the expectation of any function of θ using averages of samples drawn after convergence to the stationary distribution.”)
Ranzato, Gaidon, Kohli, Sabha, and Santos are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task with sampling of candidate outputs of Ranzato, Gaidon, Kohli, and Sabha, with the sampling from a stationary distribution of Santos. The modification would have been obvious because one of ordinary skill in the art would have been motivated to use a sample from the same distribution for the entire training duration, so that resampling will not be required during later training iterations, thereby increasing the efficiency of training (Santos, Sec II A:  “We can estimate the expectation of any function of θ using averages of samples drawn after convergence to the stationary distribution. The key to the MCMC approach is the use of an appropriate sampling scheme to construct ergodic Markov chains that converge to such a stationary distribution. If run long enough, these chains will 

Claims 12 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sabha, further in view of Risholm et. al. (“Bayesian characterization of uncertainty in intra-subject non-rigid registration”; hereinafter Risholm).
As per Claim 12, the combination of Ranzato, Gaidon, Kohli, and Sabha teaches the method of claim 10 as shown above, as well as the score for each of the candidate outputs is based on the measure of the similarity.  However, the combination of Ranzato, Gaidon, Kohli, and Sabha does not teach wherein the score for each of the candidate outputs is based on the measure of the similarity scaled by a temperature hyper-parameter that controls a concentration of the score distribution.
Risholm teaches wherein the score for each of the candidate outputs is based on the measure of the similarity scaled by a temperature hyper-parameter that controls a concentration of the score distribution.  (Risholm, Sec 2.1 p 540 right column, discloses:  “By Bayes' theorem, the posterior distribution on deformation and hyper-parameters T ~ {Ts, Tt} can be written as: [Eq. 2]. Assuming the dissimilarity and regularization energies (Es and Er respectively) as the sufficient statistics of the posterior p(u l T,m,f). then the Boltzmann's distribution obtained from these energies has maximum entropy that satisfies this sufficiency property. Modeled with Boltzmann's distribution, the likelihood with a SSD energy functional is now equivalent to a voxel-based i.i.d. Gaussian noise model with zero mean and variance Ts.”  Here, Risholm discloses a hyperparameter T comprising values Ts, Tt. Risholm also discloses that Ts is a variance term.  The term “variance” describes the concentration of a distribution, indicating how spread out the values are around the mean.  Risholm, Section 2.1 p 541 Para 2, goes on to describe this T parameter as a “temperature hyperparameter”:  “With fixed values of the temperature hyper-parameters (HPs) T , the posterior distribution on the displacement field takes the following form:  [Eq. 3]”.  Therefore, Risholm discloses a temperature hyperparameter that controls a concentration of the score distribution.  Also, Risholm discloses that T scales the similarity measure in Eq. 3, where the values of functions labelled “E” appear in the numerator and the temperature hyperparameter “T” appears in the denominator.  “E” is described in Risholm as being related to similarity in Sec 2 Lines 7-9: “image dissimilarity measure Es(u; f, m) and a regularization term E(u) on the displacements.”  Therefore, the temperature hyperparameter scales the similarity measure.)
Ranzato, Gaidon, Kohli, Sabha, and Risholm are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task with sampling of candidate outputs of Ranzato, Gaidon, Kohli, and Sabha, with the temperature hyperparameter of Risholm. The modification would have been obvious because one of ordinary skill in the art would have been motivated to have the ability to fine tune the score distribution, to thus improve the accuracy of training of the structured prediction task, which depends on sampling auxiliary outputs from the score distribution (Risholm, Intro p 539 Last Paragraph:  “We demonstrate the method with a Sum of Squared Differences (SSD) based likelihood, where the Boltzmann temperature models the 

As per Claim 13, the combination of Ranzato, Gaidon, Kohli, Sabha, and Risholm teaches the method of claim 12 as shown above, as well as wherein the score for each of the candidate auxiliary outputs is proportional to the scaled measure of the similarity exponentiated.  (Risholm, Eq 3, discloses terms with functions labelled Es and Et representing similarity measures in the numerators with respective temperature hyperparameters Ts and Tt in the denominators, together representing the scaled values of the similarity measure.  One can also see in Eq. 3 that these values are inside an “exp” term, and are thus exponentiated.)

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sabha further in view of Modarresi et. al. (US 2017/0116530 A1; hereinafter Modarresi).
As per Claim 14, despite reciting “The method of Claim 9”, “wherein sampling the candidate output comprises” lacks antecedent basis, and appears to be directed to Claim 10.  Examiner is interpreting as “The method of Claim 10”.  The combination of Ranzato, Gaidon, Kohli, and Sabha teaches the method of claim 10 as shown above.  However, the combination of Ranzato, Gaidon, Kohli, and Sabha does not explicitly teach wherein sampling the candidate output comprises: sampling the candidate output using stratified sampling.
Modarresi teaches wherein sampling the candidate output comprises: sampling the candidate output using stratified sampling.  (Modarresi, Para [0047], discloses:  “Sampling of any form may be used, such as random sampling or stratified sampling.”)
Ranzato, Gaidon, Kohli, Sabha and Modarresi are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task with sampling of candidate outputs of Ranzato, Gaidon, Kohli, and Sabha, with the stratified sampling of Modarresi.  The modification would have been obvious because one of ordinary skill in the art would have been motivated to improve the accuracy of the structured prediction model (Modarresi [0047]: “With stratified sampling, a population (e.g., set of potential values for a parameter) is divided into different subgroups or strata and, thereafter, samples are randomly selected proportionally from the different strata in accordance with respective probabilities. As such, stratified sampling can be 

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Ranzato, Gaidon, Kohli, and Sabha further in view of Vasseur et. al. (US 2017/0279834 A1; hereinafter Vasseur).
As per Claim 15, despite reciting “The method of Claim 9”, “wherein sampling the candidate output comprises” lacks antecedent basis, and appears to be directed to Claim 10.  Examiner is interpreting as “The method of Claim 10”.  The combination of Ranzato, Gaidon, Kohli, and Sabha teaches the method of claim 10 as shown above.  However, the combination of Ranzato, Gaidon, Kohli, and Sabha does not explicitly teach wherein sampling the candidate auxiliary output comprises: sampling the candidate output using importance sampling.
Vasseur teaches wherein sampling the candidate auxiliary output comprises: sampling the candidate output using importance sampling. (Vasseur, Para [0132], discloses:  “For each managed UPC 516, perform an importance sampling of the anomaly, as described below”)
Ranzato, Gaidon, Kohli, Sabha, and Vasseur are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the structured prediction task with sampling of candidate outputs of Ranzato, Gaidon, Kohli, and Sabha, with the importance sampling of Vasseur.  The modification would have been obvious because one of ordinary skill in the art would have been motivated to improve the accuracy of training by avoiding bias from the training set (Vasseur [0132]: “Recall 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-

/L.A.S./Examiner, Art Unit 2126 
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126