Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 1, 2018, is being examined under the first inventor to file provisions of the AIA .

Claim 1-20 are pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/01/2018. The submission is
in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure
statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
In regards to claim 1, the claim recites the limitation ‘employing a weighted cross-entropy loss layer for classification accounting for an imbalance between background classes and object classes’ in line 8-9. The claim is indefinite as it introduces indefinite term ‘weighted cross-entropy loss layer’, it is unclear what constitutes a “weighted cross-entropy loss layer”, where is the layer located, whether it is 
For purpose of examination that claim is being interpreted as: calculating result of a loss layer (i.e. loss function) between expected prediction result and actual prediction result.
The limitation of ‘employing a boundary loss layer to enable transfer of knowledge of bounding box regression from the teacher model to the student model’ in line 10-11, is indefinite as it is introducing unclear terms ‘boundary loss layer’ and ‘bounding box regression’, as it is unclear what constitutes a “boundary loss layer” and ‘bounding box regression’, where is the layer located, whether it is located in the teacher model or student model, what type of data the layer receives, and what type of data the layer outputs.
For purpose of examination that claim is being interpreted as: calculating result of a loss layer (i.e. loss function) to transfer knowledge of loss function from teacher model to student model.
The limitation of ‘confidence-weighted binary activation loss layer’ in line 12-14 is indefinite as it is unclear what constitutes a “confidence-weighted binary activation loss layer”, where is the layer located, if it is located in the teacher model or student model, what type of data the layer receives, and what type of data the layer outputs. The scope of the invention would not be easily understood to someone of ordinary skill in the art given the unclear wording of the claims. 
For purpose of examination that claim is being interpreted as: calculating a result of a loss layer (i.e. loss function) to train the student model to achieve similar distribution of neurons as achieved by the teacher model. 
Claim 2-7 depends on the claim 1 and inherits the same deficiency. Therefore, rejected by the same reasoning as claim 1.
Claim 8 and 15 having similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above. 

Claim 9-14 and 16-20 depend on the claim 8 and 15, and inherits the same deficiency. Therefore, rejected by the same reasoning as claim 8 and 15. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 1-20 is/are rejected under U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1,
2A Prong 1: The limitation of ‘training by learning a student model from a teacher model’ is a mental process, because the limitation encompasses the teacher teaches student to imitate the teacher’s calculation. The limitation of ‘employing a weighted cross-entropy loss layer for classification accounting for an imbalance between background classes and object classes’ is a mathematical concept, because weighted cross-entropy loss layer is a loss function which is a mathematical concept, and calculating imbalance between two data is also a mathematical concept. The limitation of ‘employing a boundary loss layer to enable transfer of knowledge of bounding box regression from the teacher model to the student model’ is an abstract idea, because boundary loss layer is a function which performs bounding box regression, which leads to mathematical concept, and the limitation of transfer of knowledge from the teacher model to the student model is a mental process because it encompasses the user learning mathematical concept from another user. The limitation of ‘employing a confidence-weighted binary activation loss layer to train intermediate layers of the student model to achieve similar distribution of neurons as achieved by the teacher model’ is an abstract idea, because the confidence-weighted binary activation loss layer is a loss function which is a mathematical concept, and student model imitating teacher model encompasses a user imitating result from other user with pen and paper.
2A Prong 2: The judicial exception is not integrated into a practical application. In particular, the claim recites additional element - at least one processor. The processor recites high level of generality (i.e. as a generic processor performing a generic computer function of calculating loss) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Furthermore, the claim recites the limitation of ‘inputting a plurality of images into the Faster R-CNN’, which is a form of insignificant extra-solution activity. The claim is directed to an abstract idea.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. As discussed above with respect of integration of the abstract idea into a practical application, the additional element of using a processor to perform loss function calculation amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Furthermore, the limitation of ‘employing a Faster R-CNN’, merely indicating which particular technological field or environment the abstract idea is performed in. The claim recites the limitation of ‘inputting a plurality of images into the Faster R-CNN’, which was considered to be insignificant extra-solution activity in Step 2A Prong 2, and thus it is re-evaluated in Step 2B to determine if it is more than what is well-understood routine and conventional activity in the field as mere data gathering (MPEP 2106.05(g)). The limitation of ‘Faster R-CNN’ , merely says which particular technological field or environment the abstract idea is performed in (MPEP 2106.05(h)), i.e. training inputs using specific 

Regarding claim 8, the limitation of ‘a system for training fast models for real-time object detection with knowledge transfer, the system comprising: a memory; and a processor’ is a generic computer component. Claim 8 is a system claim having similar limitations to method claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above. The claim is not patent eligible.

Regarding claim 15, the limitation of a transitory computer-readable medium comprising a computer-readable program for training fast models is a generic computer component is a generic computer. Claim 15 is a transitory computer-readable medium claim having similar limitations to method claim 1 above. Therefore, it is an abstract idea under the same rational as of claim 1 above. The claim is not patent eligible.

Regarding claim 2,
2A Prong 1: The limitation of hint-based learning that enables a feature representation of the student model to be similar to a feature representation of the teacher model is a mental process, because it encompasses the student tries to imitate what teacher is doing (i.e. locating specific object in a photo) by pen and paper. 
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 9 is a system claim having similar limitations to method claim 2 above. Therefore, it is an abstract idea under the same rational as of claim 2 above. 


Regarding claim 3,
2A Prong 1: The limitation of ‘further comprising enabling the hint-based learning to provide hints to the student model for finding local minima’, is a mental process, because it encompasses the user receives hint from teacher to find a local minimum of a mathematical function, which can be performed in human mind. Finding local minima of a function appear to be a mathematical process. 
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 10 is a system claim having similar limitations to method claim 3 above. Therefore, it is an abstract idea under the same rational as of claim 3 above. 
Claim 17 is a non-transitory computer readable medium claim having similar limitations to method claim 3 above. Therefore, it is an abstract idea under the same rational as of claim 3 above. 

Regarding claim 4,
2A Prong 1: The limitation of ‘applying a larger weight for the background classes and a smaller weight for the object classes in the weighted cross-entropy loss layer’ is a mathematical process, because applying a larger weight on specific dataset is an instruction which has same meaning as multiplying larger numbers to the dataset to emphasize the background class, which is a mathematical calculation.
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 11 is a system claim having similar limitations to method claim 4 above. Therefore, it is an abstract idea under the same rational as of claim 4 above. 
Claim 18 is a non-transitory computer readable medium claim having similar limitations to method claim 4 above. Therefore, it is an abstract idea under the same rational as of claim 4 above. 

Regarding claim 5,
2A Prong 1: The limitation of setting a prediction vector of the loss function to approximate a class label in the boundary loss layer is a mental process, because it encompasses the user setting an initial prediction data before approximating a result of another prediction with the initial prediction data, which can be performed in human mind with pen and paper.
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 12 is a system claim having similar limitations to method claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above. 
Claim 19 is a non-transitory computer readable medium claim having similar limitations to method claim 5 above. Therefore, it is an abstract idea under the same rational as of claim 5 above. 

Regarding claim 6,
2A Prong 1: The limitation of allowing the student model to learn from a bounding box location of the teacher model in a loss layer is a mental process, because it encompasses the user imitating prediction result from the teacher, which can be performed in human mind with pen and paper.
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 13 is a system claim having similar limitations to method claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above. 
Claim 20 is a non-transitory computer readable medium claim having similar limitations to method claim 6 above. Therefore, it is an abstract idea under the same rational as of claim 6 above. 

Regarding claim 7,
2A Prong 1: The limitation of applying a positive gradient to the layer of student model when a confidence of teacher model is greater than a confidence of the student model, is a mental process, because it encompasses the user judging performance of teacher and student, and giving positive score if the teacher performs better than student, which is a process can be done in human mind. 
2A Prong 2: The judicial exception is not integrated into a practical application, because there is no additional element.
2B: The claim does not recite additional elements that amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 14 is a system claim having similar limitations to method claim 7 above. Therefore, it is an abstract idea under the same rational as of claim 7 above. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


	Claim 1-3, 5-10, 12-17, and 19-20 is/are rejected under 35 U.S.C. 103 over Shen (Shen et al, 12/01/2016 “In Teacher We Trust: Learning Compressed Models for Pedestrian Detection”) in view of Li (Li et al, 2/14/2017, “Learning without Forgetting”), and further in view of Zagoruyko (Zagoruyko et al, 2/12/2017, “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks Via Attention Transfer”).
	
Regarding claim 1, Shen teaches 
a computer-implemented method executed by at least one processor for training fast models for real-time object detection with knowledge transfer, the method comprising: 
employing a Faster Region-based Convolutional Neural Network (R-CNN) as an objection detection framework for performing the real-time object detection ([Shen, 1.Introduction, 2nd paragraph] “the top three approaches for pedestrian detection as measured on the Caltech Pedestrian Dataset [10] consist of MSCNN [3], RPN+BF [30], both built upon the Faster-RCNN [25] architecture containing over 100 million parameters, and SA-FastRCNN [21] which features a network with over 30 million parameters”); 
inputting a plurality of images into the Faster R-CNN ([Shen, 5.1.Dataset, 2nd paragraph] “We follow the setup of Caltech10x in [18] and sample every 3rd frame for training. We use the Reasonable configuration when testing on the Caltech test set, which samples every 30th frame and includes only pedestrians without significant occlusion with a minimum height of 50 pixels and excludes the labels “people” and “person” ”, Caltech10x dataset contains plurality of images); 
and training the Faster R-CNN by learning a student model from a teacher model by ([Shen, Figure 1; 1.Introduction, 2nd paragraph; 3rd paragraph] “For instance, at the time of writing, the top three approaches for pedestrian detection as measured on the Caltech Pedestrian Dataset [10] consist of MSCNN [3], RPN+BF [30], both built upon the Faster-RCNN [25] architecture containing over 100 million parameters, and SA-FastRCNN [21] which features a network with over 30 million parameters … These large networks contain many redundant parameters [20, 7], so in theory they could be much smaller. To demonstrate this, we adopt Knowledge Distillation (KD) [17] to train a small student network to mimic the large teacher network … These large networks contain many redundant parameters [20, 7], so in theory they could be much smaller. To demonstrate this, we adopt Knowledge Distillation (KD) [17] to train a small student network to mimic the large teacher network”): 
employing a weighted cross-entropy loss layer for classification accounting for an imbalance between background classes and object classes ([Shen, 3.Knowledge Distillation, 3rd paragraph] “The loss function L used for training the student is a combination of the soft loss Lsoft, the cross-entropy loss between the soft outputs of the student and teacher, as well as the hard loss Lhard, the standard classification cross-entropy loss between the student outputs and the ground truth labels”, the paragraph discloses hard loss, which is the result of cross-entropy loss between ground truth label that corresponds to the background classes and student output that corresponds to the object class);  
employing a boundary loss layer to enable transfer of knowledge from the teacher model to the student model ([Shen, 1.1 Contributions, 1st paragraph; Figure 1 (b)] “In this paper we propose to use Knowledge Distillation to compress a large network for pedestrian classification. We explore variations on the training process by learning from the outputs of a hint layer inserted before the final fully-connected layer, introducing a loss function that takes into account output covariances”, [Shen, 2. Related Works, Transfer Learning. 2nd paragraph] “Fit-Nets [26] use Knowledge Distillation with intermediate hint layers to train a thinner but deeper student network containing fewer parameters that outperforms even the teacher network”, a layer before final fully-connected layer called ‘hint layer’ transfers the knowledge from teacher to student. The layer ‘boundary loss layer’ is just a name of loss layer); 
employing a confidence-weighted binary activation loss layer to train the student model to achieve similar distribution of neurons as achieved by the teacher model ([Shen, 2. Related Works, Transfer Learning. 2nd paragraph] “Fit-Nets [26] use Knowledge Distillation with intermediate hint layers to train a thinner but deeper student network containing fewer parameters that outperforms even the teacher network”, a layer before final fully-connected layer called ‘hint layer’ transfers the knowledge from teacher to student, [Shen, 3. Knowledge Distillation; 4.2. Learning With Confidence] “The loss function L used for training the student is a combination of the soft loss Lsoft, the cross-entropy loss between the soft outputs of the student and teacher, as well as the hard loss Lhard, the standard classification cross-entropy loss between the student outputs and the ground truth labels”, “By doing so, we are fitting a multivariate Gaussian distribution to the teacher outputs, from which it is possible to measure the likelihood of the student output as being drawn from the distribution … This function is the square of the Mahalanobis distance. Compared to the mean-square distance, it is smaller along dimensions of high variability, consistent with our idea of reporting smaller gradients for outputs that the teacher is not confident in”, discloses the plurality of different types of loss function including cross entropy loss function, and Mahalanobis distance)
Shen does not specifically teach transferring knowledge of bounding box regression from teacher model to student model, and training intermediate layers of the student model with a confidence-weighted binary activation loss layer.
Li, page 4, right column, line 37, 3 Learning Without Forgetting, 5th paragraph] “We use the Knowledge Distillation loss, which was found by Hinton et al. [11] to work well for encouraging the outputs of one network to approximate the outputs of another. This is a modified cross-entropy loss that increases the weight for smaller probabilities”).
Li teaches a transferring knowledge of bounding box regression from the teacher model to the student model ([Li, page 12, Appendix A, A.1 MD-Net] “The algorithm picks the bounding box with the highest foreground score, apply a bounding box regression, and report the regression result … A bounding box regression layer is trained on top of the convolutional layers from the first frame’s data, and is kept unchanged”, [Li, page 3, 1st paragraph of 2.2 Tropically relevant methods] “Our work also relates to methods that transfer knowledge between networks. Hinton et al. [11] propose Knowledge Distillation, where knowledge is transferred from a large network or a network assembly to a smaller network for efficient deployment.” discloses student model learning from teacher model).
It would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention, having both teachings of Li and Shen, to use bounding box regression of Li to calculate loss between teacher model and student model of Shen. The suggestion and/or motivation for doing so is to calculate difference between expected boundary box and actual boundary box, which helps improving the result of student object detection model.
Shen in view of Li does not specifically discloses training intermediate layers of the student model with a confidence-weighted binary activation loss layer.
	
Zagoruyko teaches training intermediate layers of the student model with a confidence-weighted binary activation loss layer ([Zagoruyko, 3.1 Activation-Based Attention Transfer, 5th paragraph; Figure 5] “In attention transfer, given the spatial attention maps of a teacher network (computed using any of the above attention mapping functions), the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher. In general, one can place transfer losses w.r.t. attention maps computed across several layers. For instance, in the case of ResNet architectures, one can consider the following two cases, depending on the depth of teacher and student: Same depth: possible to have attention transfer layer after every residual block”, discloses that the transfer loss (i.e. difference between activation of student and teacher) is used to train student model to achieve similar distribution of neurons as teacher, and the loss function is calculated in several layers).
It would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention, having both teachings of Zagoruyko, Li and Shen, to use confidence-weighted binary activation loss layer to train intermediate layers of student model of Zagoruyko to calculate loss between teacher model and student model of Shen and Li. The suggestion and/or motivation for doing so is to compare and reduce the difference between intermediate layers of teacher models and intermediate layers of student models, which helps the result of the student model imitates the result of the teacher model.
As per claim 8, Shen in view of Li, and further in view of Zagoruyko teaches a system for training fast models for real-time object detection with knowledge transfer, the system comprising: a memory; and a processor in communication with the memory, wherein the processor runs program code to ([Shen, 5.4 Training configuration, 4th paragraph] “The models are trained with the Torch framework on a NVIDIA Titan X GPU with 12GB memory”). Claim 8 is a system claim having similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.

 As per claim 15, Shen in view of Li, and further in view of Zagoruyko teaches a non-transitory computer-readable storage medium comprising a computer-readable program for training fast models for real-time object detection with knowledge transfer, wherein the computer-readable program when executed on a computer causes the computer to perform the steps of ([Shen, 5.4 Training configuration, 4th paragraph] “The models are trained with the Torch framework on a NVIDIA Titan X GPU with 12GB memory”, 12GB memory is the non-transitory computer-readable storage medium). Claim 15 is a non-transitory computer-readable storage medium claim having similar limitations to method claim 1 above. Therefore, they are rejected under the same rational as of claim 1 above.

	Regarding claim 2, Shen in view of Li, and further in view of Zagoruyko teaches the method of claim 1, further comprising adopting hint-based learning that enables a feature representation of the student model to be similar to a feature representation of the teacher model ([Zagoruyko, 3.1 Activation-Based Attention Transfer, 5th paragraph; Figure 5] “In attention transfer, given the spatial attention maps of a teacher network (computed using any of the above attention mapping functions), the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher. In general, one can place transfer losses w.r.t. attention maps computed across several layers. For instance, in the case of ResNet architectures, one can consider the following two cases, depending on the depth of teacher and student: Same depth: possible to have attention transfer layer after every residual block”, discloses that the transfer loss (i.e. difference between activation of student and teacher) is used to train student model to achieve similar distribution of neurons as teacher”, discloses that the student model is learn to imitate the teacher model’s feature, and [Zagoruyko, 4.1.2] “we experimented with FitNets-style hints using l2 losses on full activations directly, with 1 x 1 convolutional layers to match tensor shapes, and found that improvements over baseline student were minimal (see column F-ActT in table 1)”, discloses FitNets-style hints were used).

	As per claim 9, Shen in view of Li, and further in view of Zagoruyko teaches the system of claim 8. Claim 9 is a system claim having similar limitations to method claim 2 above. Therefore, they are rejected under the same rational as of claim 2 above. 



	Regarding claim 3, Shen in view of Li, and further in view of Zagoruyko teaches the method of claim 2, further comprising enabling the hint-based learning to provide hints to the student model for finding local minima ([Shen, 4.1 Hint Layer - 4.2 Learning With Confidence] “To increase the dimensionality of the data that the student learns from, we introduce a hint layer, a fully-connected
(FC) layer with 64 outputs in front of the final FC layer, and train the student to match the outputs of the hint layer instead. If the student network can perfectly match the hint layer outputs, then just by copying over the teacher’s final FC layer, the student will be able to mimic the teacher’s outputs.”, discloses the hint layer, and “Maximizing the log-likelihood of Equation 5 is equivalent to minimizing the following loss function:             
                
                    
                        L
                    
                    
                        s
                        o
                        f
                        t
                    
                
                =
                
                    
                        (
                        
                            
                                Y
                            
                            
                                S
                            
                        
                        -
                        
                            
                                Y
                            
                            
                                T
                            
                        
                        )
                    
                    
                        T
                    
                
                
                    
                        ∑
                        
                        
                            -
                            1
                        
                    
                    
                        
                            
                                (
                                
                                    
                                        Y
                                    
                                    
                                        S
                                    
                                
                                -
                                
                                    
                                        Y
                                    
                                    
                                        T
                                    
                                
                                )
                            
                            
                        
                    
                
            
        ”, discloses the finding local minima of a loss function).
	As per claim 10, Shen in view of Li, and further in view of Zagoruyko teaches the system of claim 9. Claim 10 is a system claim having similar limitations to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above. 

	As per claim 17, Shen in view of Li, and further in view of Zagoruyko teaches the non-transitory computer-readable storage medium of claim 16. Claim 17 is a non-transitory computer-readable storage medium claim having similar limitations to method claim 3 above. Therefore, they are rejected under the same rational as of claim 3 above. 

[Li, APPENDIX A. Tracking with MD-NET using LwF] “The task is to find the bounding box of the tracked object as each image frame is given, where the very first frame’s ground-truth bounding box is known … The algorithm picks the bounding box with the highest foreground score, apply a bounding box regression, and report the regression result”, prediction vector is a initial value used to perform bounding box regression, and approximating class label is the same as generating prediction (i.e. regression) result, as shown in the Figure 6 of Li, where the figure shows the neural network generating new task label).

	As per claim 12, is a system claim having similar limitations to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above. 

	As per claim 19, is a non-transitory computer-readable storage medium claim having similar limitations to method claim 5 above. Therefore, they are rejected under the same rational as of claim 5 above. 

Regarding claim 6, Shen in view of Li, and further in view of Zagoruyko teaches the method of claim 1, further comprising allowing the student model to learn from a bounding box location of the teacher model in the boundary loss layer ([Li, page 12, Appendix A, 1st paragraph; A.1 MD-Net] “The task is to find the bounding box of the tracked object as each image frame is given, where the very first frame’s ground-truth bounding box is known … A bounding box regression layer is trained on top of the convolutional layers from the first frame’s data, and is kept unchanged”, discloses the bounding box location and boundary loss layer, as the task to find the bounding box of an object encompasses finding the location of the box, [Li, page 3, 1st paragraph of 2.2 Tropically relevant methods] “Our work also relates to methods that transfer knowledge between networks. Hinton et al. [11] propose Knowledge Distillation, where knowledge is transferred from a large network or a network assembly to a smaller network for efficient deployment.” discloses student model learning from teacher model).
	As per claim 13, is a system claim having similar limitations to method claim 6 above. Therefore, they are rejected under the same rational as of claim 6 above. 
	As per claim 20, is a non-transitory computer-readable storage medium claim having similar limitations to method claim 6 above. Therefore, they are rejected under the same rational as of claim 6 above. 

Regarding claim 7, Shen in view of Li, and further in view of Zagoruyko teaches the method of claim 1, further comprising applying a positive gradient to the intermediate layers of the student model when a confidence of the teacher model is greater than a confidence of the student model in the confidence-weighted binary activation loss layer ([Shen, 5th paragraph of 4.1 Hint Layer – 2nd paragraph and 6th paragraph of 4.2 Learning With Confidence] “This is because the ReLU function discards information of negative values, and also because the gradient for where the student predicts a negative value is ignored, leading to instabilities in training … Intuitively, if the teacher reports that it is very confident about its prediction, then the student should trust the teacher more, and if the teacher instead reports that it is not confident about its prediction, then the student should balance mimicking the teacher with predicting the correct label … Compared to the mean-square distance, it is smaller along dimensions of high variability, consistent with our idea of reporting smaller gradients for outputs that the teacher is not confident in”, Shen discloses providing smaller gradient value to the teacher if the teacher is not confident in the prediction than student model).
	As per claim 14, is a system claim having similar limitations to method claim 7 above. Therefore, they are rejected under the same rational as of claim 7 above. 

Shen et al, 12/01/2016 “In Teacher We Trust: Learning Compressed Models for Pedestrian Detection”) in view of Li (Li et al, 2/14/2017, “Learning without Forgetting”), in view of Zagoruyko (Zagoruyko et al, 2/12/2017, “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks Via Attention Transfer”), and further in view of Lee (KR 20180096164 A).

Regarding claim 4, Shen in view of Li, and further in view of Zagoruyko teaches the method of claim 1, further comprising weighted cross-entropy loss layer ([Shen, 3.Knowledge Distillation, 3rd paragraph] “The loss function L used for training the student is a combination of the soft loss Lsoft, the cross-entropy loss between the soft outputs of the student and teacher, as well as the hard loss Lhard, the standard classification cross-entropy loss between the student outputs and the ground truth labels”, the paragraph discloses hard loss, which is the result of cross-entropy loss between ground truth label that corresponds to the background classes and student output that corresponds to the object class). Lee also teaches weighted cross-entropy loss layer ([Li, page 4, right column, line 37, 3 Learning Without Forgetting, 5th paragraph] “We use the Knowledge Distillation loss, which was found by Hinton et al. [11] to work well for encouraging the outputs of one network to approximate the outputs of another. This is a modified cross-entropy loss that increases the weight for smaller probabilities”, discloses multiple types of loss function including cross-entropy loss function with weight adjustment).
Shen in view of Li, and further in view of Zagoruyko does not specifically teach applying different weights to background classes and object classes with weights. 
Lee teaches further comprising a larger weight for the background class and a smaller weight for the object class ([Lee, 5th page, 11th paragraph] “The area to which the low weight is applied is the area corresponding to the background of the input image, and the area to which the high weight is applied is the area corresponding to the object of the input image”).


As per claim 11, is a system claim having similar limitations to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above. 

	As per claim 18, is a non-transitory computer-readable storage medium claim having similar limitations to method claim 4 above. Therefore, they are rejected under the same rational as of claim 4 above. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Regarding teacher-student model.
EP 3144859 A2

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more 

/JUN KWON/
Examiner, Art Unit 2127
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127