DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to Applicant’s reply filed on 2/2/2021.

Status of Claims

The following claims are pending in this office action: 1-20
The following claims are amended in this office action: 1-5, 11-15, 20
The following claims is/are rejected: 1-20
Claims 1-20 are presented for examination.

Specification
The amendment to the Specification is acknowledged and the objection to the drawing has been withdrawn. 

Response to Arguments 

Applicant’s argument with respect to the Drawings have been fully considered and are persuasive. The objection to the Drawing of 10/01/2020 has been withdrawn after reviewing the corrected Specification submission.
Applicant’s arguments with respect to the 112(b) rejections to Claims 1-20 have been fully considered and are persuasive. The rejections to Claims 1-20 of 10/01/2020 have been withdrawn as a result of the Applicant’s amendments to Independent Claims 1, 11 and 20, Dependent Claims 4 and 14, and Dependent Claims 5 and 15 which addressed the indefiniteness.
Applicant’s arguments with respect to claims 1, 3-8, 11, 13-18 and 20 have been considered but are moot because the arguments are directed to amended limitations that have not been previously examined.
Regarding Applicant’s arguments with respect to claims 1, 3-8, 11, 13-18 and 20, Applicant asserts that none of the cited references teach or suggest each and every limitation of the independent claims. Specifically, Applicant amended each of the independent claims to further distinguish Applicant’s claim language from the cited references as follows:
“after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector.”
Applicant states that these limitations make clear that the target prediction vector is updated after the unsupervised component is calculated and the target prediction vector is updated to include at least a portion of the prediction vector.
	The Examiner respectfully disagrees because upon further review of the Sajjadi reference and new understanding and interpretation from the amended claims, Sajjadi teaches after calculating the unsupervised component (Sajjadi Section 2 – Related Work recites the following:
“Another notable example is semi-supervised learning with ladder networks [28] in which the sums of supervised and unsupervised loss functions are simultaneously minimized by backpropagation. In this method, a feedforward model, is assumed to be an encoder. The proposed network consists of a noisy encoder path and a clean one. A decoder is added to each layer of the noisy path. This decoder is supposed to reconstruct a clean activation of each layer. The unsupervised loss function is the difference between the output of each layer in clean path and its corresponding reconstruction from the noisy path.”

The unsupervised loss function is the difference between the output of each layer in clean path and its corresponding reconstruction from the noisy path and this is minimized via backpropagation (i.e. after calculating unsupervised loss updating the target prediction vector)), 
updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector (Sajjadi Section 3 – Method recites the following: 
“We can design batches to contain replications of training samples so we can easily optimize this transformation/stability loss function. If we use data augmentation, we put different transformed versions of an unlabeled data in the mini-batch instead of replication. This unsupervised loss function can be used with any backpropagation-based algorithm. Even though, every mini-batch contains replications of a training sample, these are used to calculate a single backpropagation signal avoiding gradient bias and not adversely affecting convergence. It is also possible to combine this loss with any supervised loss function.” 

“The proposed loss minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample, but it does not impose any restrictions on the individual elements of a single prediction vector.”

Combine the unsupervised loss function with any supervise loss function and used with any backpropagation-based algorithm, every mini-batch contains replications of a training sample, and the proposed loss minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample not imposing any restrictions on the individual elements of a single prediction vector (i.e. after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector)). 
	Therefore Sajjadi meets the amended limitations as recited above.

Regarding Applicant’s argument that Coleman fails to teach or suggest the limitations of “after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector.” Examiner finds the argument persuasive. 
However, Coleman is not relied upon to teach the amended limitation “after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector.”
	Therefore, the rejection to claims 2 and 12 remain.

Regarding the amended independent claims 11 and 20, Sajjadi teaches the amended claim language of “after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector.” as cited in claim 1. 
Therefore, claims 11 and 20 are rejected under the same rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 3-8, 11, 13-18, and 20 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ros Sanchez et al. (US 2017/0262735 A1, herein Ros Sanchez) in view of Sajjadi et al. (“Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning”, herein Sajjadi).

Regarding Claim 1.
Ros Sanchez discloses a method, comprising: receiving a set of training data for a deep neural network (Ros Sanchez [0025] lines 1-13 discloses “The recent trend in deep learning has been to strive for even deeper models…Some use an ensemble of classifiers, trained on a small but representative subset of a larger dataset, to label a larger unlabelled dataset.”, Representative subset of a larger dataset and deeper models (i.e. training data and deep neural network), 
wherein the set of training data includes a plurality of input vectors and a plurality of label vectors, each label vector in the plurality of label vectors corresponding to a particular input vector in the plurality of input vectors (Ros Sanchez [0003] lines 1-22 discloses “One of the building blocks of a CNN is a ‘convolutional layer’ which receives a two-dimensional array of values as an input… Thus, for a given filter, the corresponding small areas of each input two-dimensional image produce a single output value, which is one pixel of the respective two-dimensional output.”, input as input vector, output value as label vectors); 
and training the deep neural network utilizing the set of training data by: 
processing one of the s in the plurality of input vectors (Ros Sanchez [0037] lines 4-13 discloses “The training process uses training data comprising image data encoding training images and annotation data labelling corresponding areas of the training images. The areas are preferably individual pixels of the training images, although they might alternatively be “super-pixels” or other structures instead of individual pixels. The annotation data specifies one of a number of a predetermined set of object categories, and indicates that the corresponding area of the image is an image of an object which is in the object category specified by the annotation data.”, object categories (i.e. prediction vectors)), and 
for each prediction vector in the plurality of prediction vectors corresponding to the particular input vector (Ros Sanchez [0037] lines 4-13 as recited above, annotation data and image (i.e. each prediction vector and the particular input vector)): 
Ros Sanchez discloses a weighting function (Ros Sanchez [0078] lines 1-16 recites “To achieve reasonable per-class accuracy, weighted cross-entropy (WCE) was employed in the definition of the loss function…and the weighting function is given by: 
    PNG
    media_image1.png
    62
    397
    media_image1.png
    Greyscale
”, Weighted cross-entropy (WCE) (i.e. encompassing a supervised component); however, Ros Sanchez fails to explicitly disclose computing a loss term associated with the particular input vector by combining a supervised component and an unsupervised component according to a weighting function, wherein the unsupervised component is calculated by comparing the prediction vector with a target prediction vector associated with the particular input vector, and after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector.
Sajjadi teaches computing a loss term associated with the particular input vector by combining a supervised component and an unsupervised component according to a weighting function (Sajjadi introduction section [2] discloses the following:
“Based on this observation, we introduce an unsupervised loss function optimized by gradient descent that takes advantage of this randomization effect and minimizes the difference in predictions of multiple passes of a data sample through the network during the training phase, which leads to better generalization in testing time. The proposed unsupervised loss function specifically regularizes the network based on the variations caused by randomized data augmentation, dropout and randomized max-pooling schemes. This loss function can be combined with any supervised loss function.” 
Unsupervised loss function and supervised loss function (i.e. unsupervised component and supervised component)), 
wherein the unsupervised component is calculated by comparing the prediction vector with a target prediction vector associated with the particular input vector (Sajjadi Section 2 - Related Work [3] recites the following: 
“Another example of semi-supervised learning with ConvNets [24], which is used for text categorization. The work in [25] is also a deep semi-supervised learning method based on embedding techniques. Unlabeled video frames are also being used to train ConvNets [26, 27]. The target of the ConvNet is calculated based on the correlations between video frames. Another notable example is semi-supervised learning with ladder networks [28] in which the sums of supervised and unsupervised loss functions are simultaneously minimized by backpropagation. In this method, a feedforward model, is assumed to be an encoder. The proposed network consists of a noisy encoder path and a clean one. A decoder is added to each layer of the noisy path. This decoder is supposed to reconstruct a clean activation of each layer. The unsupervised loss function is the difference between the output of each layer in clean path and its corresponding reconstruction from the noisy path.”.
 
Video frames and unlabeled video frames (i.e. target prediction vector and particular input vector)), and
after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector to include at least a portion of the prediction vector (Sajjadi Section 3 – Method recites the following: 
“We can design batches to contain replications of training samples so we can easily optimize this transformation/stability loss function. If we use data augmentation, we put different transformed versions of an unlabeled data in the mini-batch instead of replication. This unsupervised loss function can be used with any backpropagation-based algorithm. Even though, every mini-batch contains replications of a training sample, these are used to calculate a single backpropagation signal avoiding gradient bias and not adversely affecting convergence. It is also possible to combine this loss with any supervised loss function.” 

“The proposed loss minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample, but it does not impose any restrictions on the individual elements of a single prediction vector.”

Combine the unsupervised loss function with any supervise loss function and used with any backpropagation-based algorithm, every mini-batch contains replications of a training sample, and the proposed loss minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample not imposing any restrictions on the individual elements of a single prediction vector (i.e. after calculating the unsupervised component, updating the target prediction vector associated with the particular input vector).
Ros Sanchez and Sajjadi are both directed to machine learning and in particular deep learning. In view of the teachings of Sajjadi, it would have been obvious to one of ordinary skill in the art to apply the teachings of Sajjadi to Ros Sanchez before the effective filing date of the claimed invention in order to minimize the variations in different passes and improve accuracy with a small number of labeled data available. (cf. Sajjadi Section 6 - Conclusion, in part, recites “In this paper, we proposed an unsupervised loss function that minimizes the variations in different passes of a sample through the network caused by non-deterministic transformations and randomized dropout and max-pooling schemes. We evaluated the proposed method using two ConvNet implementations on multiple benchmark datasets. We showed that it is possible to achieve significant improvements in accuracy by using the transformation/stability loss function along with mutual-exclusivity of [30] when we have a small number of labeled data available.”).


Regarding Claim 3.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein training the deep neural network utilizing the set of training data further includes stochastic augmentation of the plurality of input vectors prior to processing (Sajjadi Introduction Section [2] discloses “The proposed unsupervised loss function specifically regularizes the network based on the variations caused by randomized data augmentation, dropout and randomized max-pooling schemes.”. Randomized data augmentation (i.e. stochastic augmentation)).
See motivation for claim 1 above.

Regarding Claim 4.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein the supervised component is calculated by comparing the prediction vector with [[a]]the label vector associated with the particular input vector (Sajjadi Method Section [3] discloses “This loss function naturally complements the transformation/stability loss function. In supervised learning, each element of the prediction vector is pushed towards zero or one depending on the corresponding element in label vector.” Supervised learning (i.e. supervised component)).
See motivation for claim 1 above.

Regarding Claim 5.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein the supervised component and the unsupervised component are combined based on [[a]]the weighting function (Sajjadi Section 1 - Introduction [2] discloses the following:
“Based on this observation, we introduce an unsupervised loss function optimized by gradient descent that takes advantage of this randomization effect and minimizes the difference in predictions of multiple passes of a data sample through the network during the training phase, which leads to better generalization in testing time. The proposed unsupervised loss function specifically regularizes the network based on the variations caused by randomized data augmentation, dropout and randomized max-pooling schemes. This loss function can be combined with any supervised loss function.”

Unsupervised loss function combined with any supervised loss function (i.e. combined supervised and unsupervised component) Additionally, Ros Sanchez [0078] recites, in part, “To achieve reasonable per-class accuracy, weighted cross-entropy (WCE) was employed in the definition of the loss function…and the weighting function is given by: 
    PNG
    media_image1.png
    62
    397
    media_image1.png
    Greyscale
”).
See motivation for claim 1 above.

Regarding Claim 6.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein the deep neural network is a convolution neural network, and wherein each input vector in the plurality of input vectors is an image comprising a two- dimensional array of pixel values (Ros Sanchez [0003] line 1-22 discloses “One of the building blocks of a CNN is a “convolutional layer” which receives a two-dimensional array of values as an input. A convolutional layer comprises an integer number b of filters, defined by a respective set of numerical parameters. The input to a convolutional layer is a set of two-dimensional arrays of identical size; let us denote the number of these arrays as an integer a. Each filter is convolved with the input two-dimensional arrays simultaneously, to produce a respective two-dimensional output. During the convolution process, a given one of the filters successively receives input from successive corresponding windows (i.e. small areas) of each of the input two-dimensional arrays (the “visual field” of the filter). The size of the window may be denoted as k×k, where k is an integer; thus, the filter generates a single output value using k×k×a input values. The filter multiples these values by k×k×a respective filter values, and adds the results to give a corresponding output value. Thus, for a given filter, the corresponding small areas of each input two-dimensional image produce a single output value, which is one pixel of the respective two-dimensional output.”, CNN, input, two-dimensional array of values, and two-dimensional image (i.e. convolutional neural network, input vector, two- dimensional array of pixel values, and image)).
See motivation for claim 1 above.

Regarding Claim 7.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein updating the target prediction vector associated with the particular input vector comprises normalizing the target prediction vector according to a correction factor (Ros Sanchez [0010] lines 1-6 discloses “Another common building block of a convolutional network is a Batch normalisation (BNorm) layer. This operates on a set of input values, and uses two numerical parameters A and B. Each input values is reduced by the value A, and the then divided by parameter B, to produce a respective output value.”. Set of input values as input vector, set of output values as target prediction vector, reduced by the value of A and then divided by B as correction factor).
See motivation for claim 1 above.

Regarding Claim 8.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein the target prediction vector associated with the particular input vector is initialized to a zero vector prior to a first training epoch (Ros Sanchez [0078] lines 1-21 recites the following as depicted in the screenshot below. 

    PNG
    media_image2.png
    457
    405
    media_image2.png
    Greyscale

yijl equal to zero for one value of 1 and zero for all others for the first input (i.e. first input zero vector)).
	See motivation for claim 1 above.

Regarding Claims 11, 13-18,
Claims 11 and 13-18 are directed to a system configured to perform a method substantially identical to those recited in claims 1 and 3-10, respectively. Therefore the rejections made to claim 1 and 3-10 are applied to claims 11 and 13-18.
In addition, Ros Sanchez discloses “the computer apparatus contains, or has access to, a tangible data storage device storing program instructions (in non-transitory form) operative to cause a processor of the computer apparatus, when running the program instructions, to carry out the steps of the method for generating the S-Net and T-Net” (Ros Sanchez [0045] lines 4-9, Computer apparatus and tangible data storage device (i.e. system and memory)).

Regarding Claim 20.
Claim 20 is directed to non-transitory, computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform a method recited in claim 1. Therefore the rejection made to claim 1 is applied to claim 20.
In addition, Ros Sanchez discloses “the computer apparatus contains, or has access to, a tangible data storage device storing program instructions (in non-transitory form) operative to cause a processor of the computer apparatus, when running the program instructions, to carry out the steps of the method for generating the S-Net and T-Net (Ros Sanchez [0045] lines 4-9, tangible data storage (i.e. memory)).

Claims 2 and 12 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ros Sanchez in view of Sajjadi as applied to claims 1, 3-8, 11, 13-18, and 20 above, further in view of Coleman et al. (US 2019/0113973 A1, herein Coleman). 

Regarding Claim 2.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1, wherein the plurality of input vectors are analyzed by the deep neural network over a number of training epochs (Ros Sanchez [0090] lines 3-6 discloses “In practice this means that the resulting models are biased towards the dominant classes and producing models with higher per-class accuracy requires a higher number of epochs during training.”, Epochs during training (i.e. training epochs)). 
However, The Ros Sanchez/Sajjadi Combination does not teach and wherein the target prediction vector associated with the particular input vector updated during a current training epoch is utilized during computation of the unsupervised component of the loss term for the particular input vector during a subsequent training epoch.
Coleman teaches and wherein the target prediction vector associated with the particular input vector updated during a current training epoch is utilized during computation of the unsupervised component of the loss term for the particular input vector during a subsequent training epoch (Coleman [0278] & [0279] lines 1-2 discloses “The performance of this pipeline is determined by looking at the error rate of predictions of each epoch in the Test data into either Busy Mind or Quiet mind. [0279] The parameters are varied and the above steps are repeated until the error rate converges.”, Each epoch, error rate and test data (i.e. Current and subsequent training epoch, loss term, and input vector)).
Coleman and The Ros Sanchez/Sajjadi Combination are both directed to machine learning. In view of the teachings of Coleman, it would have been obvious to one of ordinary skill in the art to apply the teachings of Coleman to Ros Sanchez as modified by Sajjadi before the effective filing date of the claimed invention in order to takes significantly less computing resources than training the prediction model (cf. Coleman [0277], in part, recites “The epoch is classified into the category where the selected features of that epoch are closest. The prediction model takes significantly less computing resources than training the prediction model.”).

Regarding Claim 12.
Claim 12 is directed to a system configured to perform a method substantially identical to those recited in claim 2. Therefore the rejection made to claim 2 is applied to claim 12.
In addition, Ros Sanchez discloses “the computer apparatus contains, or has access to, a tangible data storage device storing program instructions (in non-transitory form) operative to cause a processor of the computer apparatus, when running the program instructions, to carry out the steps of the method for generating the S-Net and T-Net (Ros Sanchez [0045] lines 4-9, computer apparatus and tangible data storage device (i.e. system and memory)).

Claims 9-10 and 19 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ros Sanchez in view of Sajjadi as applied to claims 1, 3-8, 11, 13-18, and 20 above, further in view of Chen et al. (US 10,127,477 B2, herein Chen). 

Regarding Claim 9.
The Ros Sanchez/Sajjadi Combination teaches the method of claim 1. However, The Ros Sanchez/Sajjadi Combination does not teach wherein the deep neural network is implemented on a parallel processing unit.
	Chen teaches wherein the deep neural network is implemented on a parallel processing unit (Chen Col. 4 lines 4-10 discloses “For example, the single computing device may control execution of the plurality of threads to perform computations in parallel. Where the plurality of node devices 104 includes at least one computing device distinct from master device 102, each node device 300 may control execution of one or more threads to further perform computations in parallel.” & Fig. 3 Element 310, the examiner interprets node device processor as encompassing parallel processing unit).
Chen and The Ros Sanchez/Sajjadi Combination are both directed to machine learning. In view of the teachings of Chen, it would have been obvious to one of ordinary skill in the art to apply the teachings of Chen to Ros Sanchez as modified by Sajjadi before the effective filing date of the claimed invention in order to reduce computation time while maintaining accuracy (cf. Chen Col. 45 lines 21-24, in part, recites “Implementing some examples of the present disclosure at least in part by using the above-described machine-learning models can reduce the total number of processing iterations, time, memory, electrical power, or any combination of these consumed by a computing device when analyzing data.” Additionally, Chen Col. 36 lines 35-45 recites “By distributing the labeling task across a plurality of node devices 104, the computation time can be significantly reduced while maintaining the obtained accuracy. Master labeling application 222 in combination with local labeling application 312 perform labeling using a plurality of threads and/or a plurality of computing devices. As a result, data labeling system 100 improves an execution time significantly compared to a single thread based system. Data labeling system 100 reduces the time complexity from O(dn.sup.2) to O(dn.sup.2/V.sup.2) where V represents the total number of threads.”).

Regarding Claim 10.
The Ros Sanchez/Sajjadi/Chen Combination teaches the method of claim 9, wherein each neuron in the deep neural network is implemented as a thread executed by the parallel processing unit (Chen Col. 3 line 67 & Col. 4 lines 1-4 discloses “In an alternative embodiment, data labeling system 100 may include a single computing device that supports a plurality of threads. The described processing by node device 300 is performed by master device 102 using a different thread.”, node device is interpreted as neuron).
See motivation for claim 9 above.

Regarding Claim 19.
Claim 19 is directed to a system configured to perform a method substantially identical to those recited in claim 9-10. Therefore the rejection made to claims 9-10 are applied to claim 19.
In addition, Ros Sanchez discloses “the computer apparatus contains, or has access to, a tangible data storage device storing program instructions (in non-transitory form) operative to cause a processor of the computer apparatus, when running the program instructions, to carry out the steps of the method for generating the S-Net and T-Net (Ros Sanchez [0045] lines 4-9, computer apparatus as system, processor as processor, tangible data storage device as memory).



Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEON W CHEUNG whose telephone number is (571)272-9930.  The examiner can normally be reached on 8:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LWC/Examiner, Art Unit 4142                                                                                                                                                                                                       





/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124