Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 07/29/2022 has been entered.
 
Amendments
	Claims 1-2, 4, 5, 7, 10, 12, and 14-20 are amended. Claims 3, 6, and 9 are canceled. Claims 1-2, 4-5, 7, 10-12, and 14-20 are pending and have been considered.

Drawings
The drawings are objected to because Fig. 3 shows                         
                            
                                
                                    L
                                
                                
                                    N
                                    L
                                    L
                                
                            
                        
                     denoting a relationship between the student model and a correct answer. The specification lacks support for                         
                            
                                
                                    L
                                
                                
                                    N
                                    L
                                    L
                                
                            
                        
                     denoting a relationship between the student model and a correct. Instant specification paragraph [0066] states                         
                            
                                
                                    L
                                
                                
                                    N
                                    L
                                    L
                                
                            
                        
                     denotes a loss function used to calculate a loss between a correct answer and a recognition result of a teacher model.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claims 10, 12, and 15 are objected to because of the following informalities:  
In claim 10, on p. 4, line 2, the comma after “loss” should be removed, and on p. 4, line 5, the comma after “increases” should be removed.
In claim 12, in the third to last line, the comma after “increased” should be removed.
In claim 15, in the second to last line, the comma after “increased” should be removed. 
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 15 and 20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
The written disclosure lacks support for claim 15, especially in view of ¶ [0080] of the specification filed 03/18/2019. The “word error rate” recited by claim 15 is a word error rate between the correct answer and the second recognition result as recited by claim 12. 
The written disclosure lacks support for claim 20. The first loss is the loss between a first recognition result of the student model and a correct answer. The written disclosure does not disclose any relationship between the first loss and the word error rate.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-2, 4-5, 7, 10-11, and 17-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, in the last paragraph, lines 1-3 make the claim indefinite. The meaning of “decreases” is unclear. Examiner does not understand how the first factor may decrease a contribution of the teacher model when the first factor appears to be proportional to the contribution of the teacher model under the BRI of lines 16-17. For purposes of examination, examiner interprets lines 1-3 of the last paragraph to mean an increase in a training epoch of the student model causes the first factor to decrease. Additionally, the last three lines of claim 1 are difficult to understand. Examiner interprets these lines to mean the first factor decreases an amount the teacher model contributes to the training of the student model, wherein the decreased contribution amount is less than a contribution amount in a previous training epoch. Claims 2, 4-5, 7, and 10-11 are rejected for failing to cure the deficiencies of claim 1 upon which they depend.
Claim 11 is directed to a computer product and the claim explicitly recites performing the method of claim 1. Claim 11, line 2 recites the limitations “a processor” and “the processor”. Claim 1, line 3 recites the limitations “a processor” and “the processor”. The broadest reasonable interpretation of claim 11 includes the existence of two different processors. Therefore, the limitations “the processor” in both claim 11, line 2 and claim 1, line 3 are indefinite because it is unclear which of potentially two processors each limitation refers to. 
Regarding claim 17, the last paragraph implements the same features as the last paragraph of claim 1, and is therefore rejected for at least the same reasons therein. 
Regarding claim 18, the claim is directed to an apparatus that implements the same features as the method of claim 1, and is therefore rejected for at least the same reasons therein. Claims 19-20 are rejected for failing to cure the deficiencies of claim 18 upon which they depend.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 5, 10-11, and 17-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Romero et al. (“FitNets: Hints for Deep Thin Nets”, arXiv version 4, cited in the PTO-892 filed 04/29/2022).

	Regarding CLAIM 1, Romero teaches: A model training method comprising: executing computer readable code, stored in a non-transitory computer-readable storage medium, by a processor and configuring the processor, through the execution, to perform operations of: (The experimental results in §§ 3-4 on pp. 5-9 are evidence of a computer system comprising a processor, storage medium, and code. Romero teaches a GPU on p. 8, third to last line.)
calculating a loss including a first loss between a first recognition result of a student model and a correct answer and a second loss between a second recognition result of a teacher model and the first recognition result of the student model; and (Section 2.1 on pp. 2-3 discloses a teacher network T with an output probability                         
                            
                                
                                    P
                                
                                
                                    T
                                
                            
                        
                     and a student network S with an output probability                         
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                        
                    . According to the paragraph under equation (2) on p. 3,                         
                            
                                
                                    L
                                
                                
                                    K
                                    D
                                
                            
                        
                     is a loss function including two loss terms.                         
                            H
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    r
                                    u
                                    e
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the student model and the correct answer and                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the teacher model and the output of the student model. The teacher model is further disclosed in the paragraph starting at the end of p. 4. In summary, equation (2) on p. 3 discloses the claimed loss function. 
Calculating a loss implies at least one training epoch. P. 11, § A.1.2, ¶ 3, last line teaches training on 500 epochs. On P. 13, in the paragraph above § A.3, the last 4 lines discloses stopping conditions for the MNIST experiments.)
training the student model based on the calculated loss, (P. 3, sentence after equation (1). This limitation is interpreted as a second training epoch. P. 11, § A.1.2, ¶ 3, last line teaches training on 500 epochs. On P. 13, in the paragraph above § A.3, the last 4 lines discloses stopping conditions for the MNIST experiments.)
wherein a degree to which the second loss is reflected to the loss in the calculating of the loss is determined according to a word error rate between the first recognition result and the second recognition result, (An error rate includes the cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     in equation (2) on p. 3, and a word error rate includes any error rate involving character recognition. The experiments in §§ 3.2-3.3 on pp. 6-7 measure the student and teacher networks’ abilities to recognize digits. Therefore, the error rate in these experiments is a word error rate.)
wherein the word error rate represents information about how much the student model has learned a performance of the teacher model, (The BRI of this limitation includes knowledge distillation as taught at p. 2, § 2.1, ¶ 1.)
wherein the calculating of the loss comprises calculating the loss by applying a first factor to the word error rate, and (The BRI of this limitation includes multiplying a parameter with the cross-entropy between the teacher and student results. Equation (2) on p. 3 discloses a parameter                         
                            λ
                        
                     multiplied by the cross-entropy between the teacher and student results                          
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                    . The paragraph under equation (2) discloses λ is a tunable parameter to balance the two cross-entropies. According to p. 5, lines 4-5, λ controls the weight given to the teacher cross-entropy.)
wherein, in the calculating of the loss in a current training epoch of the student model, and in response to an increase in a training epoch of the student model, the first factor decreases a contribution of the teacher model to the training of the student model from a previous contribution of the teacher model to a previous training of the student model in a previous training epoch of2Application No. 16/356,264 the student model. (P. 5, lines 4-7)

Regarding CLAIM 5, Romero teaches: The model training method of claim 1, wherein the word error rate is updated at a training epoch of the student model. (An error rate includes the cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     in equation (2) on p. 3, and a word error rate includes any error rate involving character recognition. The experiments in §§ 3.2-3.3 on pp. 6-7 measure the student and teacher networks’ abilities to recognize digits. Therefore, the error rate in these experiments is a word error rate. The cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     is updated for every training example.)

Regarding CLAIM 10, Romero teaches: The model training method of claim 1,
wherein respective contributions of the first loss and the second loss in the calculating of the loss, is adjusted by a second factor, (A second factor is the parameter                         
                            λ
                        
                     from p. 3, eq. (2). The claim does not preclude                         
                            λ
                        
                     in eq. (2) from being both the first factor and the second factor. The parameter                         
                            λ
                        
                     adjusts respective contributions of the first loss and the second loss in eq. (2).)
	wherein the second factor is controlled so that the contribution of the teacher model to the training of the student model decreases and a contribution of the correct answer increases, in response to the increase in the training epoch of the student model. (P. 5, lines 4-7)

Regarding CLAIM 11, Romero teaches: A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor (Romero discloses a GPU processor on p. 8, § 4.2, lines 1-2. Romero also discloses experimental results in §§ 3-4 on pp. 5-9. Experiments performed by a computer are evidence of a computer-readable storage medium storing instructions which are executed by the processor.)
to perform the model training method of claim 1. (Claim 11 is a product claim that recites the same features as method claim 1. Claim 11 is rejected for the reasons set forth in the rejection of claim 1.)

Regarding CLAIM 17, Romero teaches: A data recognition method comprising: executing computer readable code, stored in a non-transitory computer-readable storage medium, by a processor and configuring the processor, through the execution, to perform operations of: (The experimental results in §§ 3-4 on pp. 5-9 are evidence of a computer system comprising a processor, storage medium, and code. Romero teaches a GPU on p. 8, third to last line.)
receiving target data to be recognized; and (P. 6, § 3.2, lines 1-3 and P. 6, § 3.3, lines 2-3.)
recognizing the target data using a student model, wherein the student model is trained based on a calculated loss including a first loss between a first recognition result of the student model and a correct answer and a second loss between a second recognition result of a teacher model and the first recognition result of the student model, (Section 2.1 on pp. 2-3 discloses a teacher network T with an output probability                         
                            
                                
                                    P
                                
                                
                                    T
                                
                            
                        
                     and a student network S with an output probability                         
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                        
                    . According to the paragraph under equation (2) on p. 3,                         
                            
                                
                                    L
                                
                                
                                    K
                                    D
                                
                            
                        
                     is a loss function including two loss terms.                         
                            H
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    r
                                    u
                                    e
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the student model and the correct answer, and                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the teacher model and the output of the student model. The teacher model is further disclosed in the paragraph starting at the end of p. 4. In summary, equation (2) on p. 3 discloses the claimed loss function. Recognizing the target data using a student model: Executing the student model for inferencing is indicated by P. 6, Tables 3 and 4’s “Misclass” columns; P. 6, paragraph under Table 3; and P. 7, paragraph above § 3.4.)
wherein a degree to which the second loss is reflected to the loss in the calculating of the loss is determined according to a word error rate between the first recognition result and the second recognition result, (An error rate includes the cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     in equation (2) on p. 3, and a word error rate includes any error rate involving character recognition. The experiments in §§ 3.2-3.3 on pp. 6-7 measure the student and teacher networks’ abilities to recognize digits. Therefore, the error rate in these experiments is a word error rate.)
wherein the word error rate represents information about how much the student model has learned a performance of the teacher model, (The BRI of this limitation includes knowledge distillation as taught at p. 2, § 2.1, ¶ 1.)
wherein the calculating of the loss comprises calculating the loss by applying a first factor to the word error rate, and (The BRI of this limitation includes multiplying a parameter with the cross-entropy between the teacher and student results. Equation (2) on p. 3 discloses a parameter                         
                            λ
                        
                     multiplied by the cross-entropy between the teacher and student results                          
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                    . The paragraph under equation (2) discloses λ is a tunable parameter to balance the two cross-entropies. According to p. 5, lines 4-5, λ controls the weight given to the teacher cross-entropy.)
wherein, in the calculating of the loss in a current training epoch of the student model, and in response to an increase in a training epoch of the student model, the first factor decreases a 5Application No. 16/356,264 contribution of the teacher model to the training of the student model from a previous contribution of the teacher model to a previous training of the student model in a previous training epoch of the student model. (P. 5, lines 4-7)

Claims 18-19 are apparatus claims that recite the same features as method claims 1-2. Claims 18-19 are rejected for the reasons set forth in the rejection of claim 1. 

Regarding CLAIM 20, Romero teaches: The model training apparatus of claim 18, wherein the6Application No. 16/356,264 processor is further configured to perform the calculating of the loss by reflecting a word error rate between the correct answer and the second recognition result to the first loss. (“A word error rate between the correct answer and the second recognition result” includes a training error between the teacher model and the correct answer. On P. 6, the bottom paragraph teaches that a teacher network was trained on images of digits, and training indicates a decrease in a word error rate between the correct answer and teacher recognition result. Under the BRI of “reflecting… to the first loss”, after the first iteration of knowledge distillation, the student network incorporates the error of the teacher network. The error of the teacher network is incorporated into the first addition term at Romero, p. 3, eq. (2).)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 2, 4, 7, 12, and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Romero et al. (“FitNets: Hints for Deep Thin Nets”, arXiv version 4).

Regarding CLAIM 2, Romero teaches: The model training method of claim 1, 
However, Romero does not explicitly teach: wherein the calculating of the loss comprises calculating the loss so that a contribution rate of the teacher model to the training of the student model is increased in response to an increase in the word error rate.
	However, in view of Romero, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have increased the parameter                         
                            λ
                        
                     over time instead of decreasing it over time, as disclosed by p. 5, lines 4-7, in order to increase the contribution of the teacher model to training of the student model.

Regarding CLAIM 4, Romero teaches: The model training method of claim 1, 
However, Romero does not explicitly teach: wherein the calculating of the loss comprises calculating the loss so that a contribution rate of the second loss in the calculating of the loss is increased in response to an increase in the word error rate.
A contribution rate of the second loss is                         
                            λ
                        
                     at p. 3, eq. (2), a second loss is cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                    , and the word error rate is also cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     in the experiments of §§ 3.2-3.3 on p. 6. According to P. 5, lines 4-7, the parameter                         
                            λ
                        
                     is annealed over the course of training because it is assumed the student can learn independently from the teacher later in training. In view of Romero, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have increased                         
                            λ
                        
                     when the cross-entropy loss                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     increases because it indicates the student has not learned yet enough from the teacher.

	Regarding CLAIM 7, Romero teaches: The model training method of claim 1,
However, Romero does not explicitly teach: wherein the calculating of the loss comprises calculating the loss so that a contribution rate of the teacher model to the training of the student model is increased in response to a decrease in a word error rate between the correct answer and the second recognition result.
P. 4, § 2.3, line 2 teaches a trained teacher network and P. 6, bottom paragraph teaches that a teacher network has been trained on images of digits. Training indicates a decrease in a word error rate between the correct answer and teacher network’s recognition result (e.g., the second recognition result). A contribution rate is interpreted as Romero’s parameter                         
                            λ
                        
                     in p. 3, eq. (2). In view of Romero, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have increased                         
                            λ
                        
                     when the teacher network’s loss is low because it means that the teacher’s outputs are very close to the correct answers. 

Regarding CLAIM 12, Romero teaches: A model training method comprising: executing computer readable code, stored in a non-transitory computer-readable storage medium, by a processor and configuring the processor, through the execution, to perform operations of: (The experimental results in §§ 3-4 on pp. 5-9 are evidence of a computer system comprising a processor, storage medium, and code. Romero teaches a GPU on p. 8, third to last line.)
calculating a loss including a first loss between a first recognition result of a student model and a correct answer and a second loss between a second recognition result of a teacher model and the first recognition result of the student model; and (Section 2.1 on pp. 2-3 discloses a teacher network T with an output probability                         
                            
                                
                                    P
                                
                                
                                    T
                                
                            
                        
                     and a student network S with an output probability                         
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                        
                    . According to the paragraph under equation (2) on p. 3,                         
                            
                                
                                    L
                                
                                
                                    K
                                    D
                                
                            
                        
                     is a loss function including two loss terms.                         
                            H
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    r
                                    u
                                    e
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the student model and the correct answer and                         
                            H
                            (
                            
                                
                                    P
                                
                                
                                    T
                                
                                
                                    τ
                                
                            
                            ,
                             
                            
                                
                                    P
                                
                                
                                    S
                                
                                
                                    τ
                                
                            
                            )
                        
                     is a cross-entropy loss between the output of the teacher model and the output of the student model. The teacher model is further disclosed in the paragraph starting at the end of p. 4. In summary, equation (2) on p. 3 discloses the claimed loss function. 
Calculating a loss implies at least one training epoch. P. 11, § A.1.2, ¶ 3, last line teaches training on 500 epochs. On P. 13, in the paragraph above § A.3, the last 4 lines discloses stopping conditions for the MNIST experiments.)
training the student model based on the calculated loss, (P. 3, sentence after equation (1). This limitation is interpreted as a second training epoch. P. 11, § A.1.2, ¶ 3, last line teaches training on 500 epochs. On P. 13, in the paragraph above § A.3, the last 4 lines discloses stopping conditions for the MNIST experiments.)
wherein a degree to which the second loss is reflected to the loss in the calculating of the loss is determined according to a word error rate between the correct answer and the second recognition result, wherein the word error rate represents information about an accuracy of the teacher model, and  (“A word error rate between the correct answer and the second recognition result” includes a training error between the teacher model and the correct answer. P. 6, bottom paragraph teaches that a teacher network was trained on images of digits, and training indicates a decrease in a word error rate between the correct answer and teacher recognition result. The more accurately the teacher network is trained, the more influence it has on training a randomly initialized student network (See p. 13, paragraph above § A.3, line 1; § A.3, ¶ 3, line 1).
However, Romero does not explicitly teach:  wherein the calculating of the loss comprises calculating the loss so that a contribution rate of the teacher model to the training of the student model is increased, in response to the word error rate being decreased from a previous word error rate between a previous correct answer and a previous second recognition result.
A contribution rate is interpreted as Romero’s parameter                         
                            λ
                        
                     in p. 3, eq. (2).  In view of Romero, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have increased the contribution parameter                         
                            λ
                        
                     over time instead of decreasing it over time, as disclosed by p. 5, lines 4-7, to increase the contribution of the teacher model to training of the student model.	

Regarding CLAIM 14, Romero teaches: The model training method of claim 12, wherein a contribution of the word error rate is selectively adjusted based on an error between the correct answer and the first recognition result. (P. 5, lines 4-7. The student network increases in accuracy as training progresses. The contribution of the teacher network decreases. The contribution of the word error rate between the correct answer and the teacher network’s recognition result (i.e., the second loss) also decreases.)

Regarding CLAIM 15, Romero teaches: The model training method of claim 12, wherein the calculating of the loss comprises calculating the loss so that a contribution rate of the first loss to the loss is increased, in response to a decrease in the word error rate. (Initially, the word error rate between the correct answer and the teacher network’s recognition result (i.e., the second loss) is decreased. P. 6, bottom paragraph teaches that a teacher network was trained on images of digits, and training indicates a decrease in a word error rate between the correct answer and teacher recognition result. Afterwards, the student network is trained according to P. 3, eq. (2). P. 5, lines 4-7 teaches the contribution parameter                         
                            λ
                        
                     decreases over time, which increases the contribution rate of the student’s loss (i.e., the first loss) to the training.)

Regarding CLAIM 16, Romero teaches: The model training method of claim 12, wherein the loss is further calculated based on a word error rate between the first recognition result and the second recognition result. (P. 3, eq. (2) contains the cross-entropy between the teacher network’s output and the student network’s output).

Response to Arguments
Examiner herein responds to the claims, remarks, and the After Final Consideration Program Request filed 07/29/2022.

After Final Consideration Program Request (Remarks p. 8): Applicant has filed both an After Final Consideration Program Request and a Request for Continued Examination. This application has been examined as a Request for Continued Examination, and the After Final Consideration Program Request has not been considered.

Claim Rejections Under 35 U.S.C. 102 and 103 (Remarks pp. 16-20): Applicant's arguments have been fully considered but they are not persuasive. 
	On p. 17, Applicant argues Romero appears to disclose data classification, but not data recognition. Examiner respectfully disagrees. The experiments in the Romero reference in § 3.2-3.3 on pp. 6-7 test the model’s ability to recognize images of house numbers and handwritten digits which comprises data recognition. On pp. 17-18, Applicant presents multiple arguments against the mappings of the Romero reference to claims 1 and 18. These arguments are not persuasive because the applicant’s interpretation of these claims is narrower than the broadest reasonable interpretations. To address Applicant’s first argument on p. 17, the limitation “a second loss between a second recognition result of a teacher model and a first recognition result of the student model” is broad enough to be interpreted as             
                H
                
                    
                        
                            
                                P
                            
                            
                                T
                            
                            
                                τ
                            
                        
                        ,
                        
                            
                                P
                            
                            
                                S
                            
                            
                                τ
                            
                        
                    
                
            
         in Romero, p. 3, eq. (2). A recognition result of a teacher model and a student model includes a relaxed output of a teacher network and a student network. This is taught by Romero, p. 3, eq. (1). Next, Applicant argues that a cross-entropy loss of Romero is not a difference between the second recognition result of the teacher model and the first recognition result of the student model. Claims 1 and 18 do not explicitly recite the limitation of a “difference”; they recite the limitation of a loss. Romero clearly teaches a cross-entropy loss at p. 3, § 2.1, Eq. (2). Third, Applicant argues that the cross-entropy loss             
                H
                
                    
                        
                            
                                P
                            
                            
                                T
                            
                            
                                τ
                            
                        
                        ,
                        
                            
                                P
                            
                            
                                S
                            
                            
                                τ
                            
                        
                    
                
            
         in Romero cannot teach “error rate” in the claim because the Office has already asserted that this cross-entropy loss of Romero corresponds to the recited “second loss”. The cross-entropy loss of Romero may teach both a second loss and an error rate under the broadest reasonable interpretation of claims 1 and 18. Applicant has not persuasively argued against this interpretation.  Applicant argues that Romero does not disclose a “word error rate”. A word error rate includes any error rate involving character recognition under the broadest reasonable interpretation of the claim. The experiments in §§ 3.2-3.3 on pp. 6-7 measure the student and teacher networks’ abilities to recognize digits. Therefore, the error rate in these experiments is a word error rate. 
On p. 19, Applicant argues that Romero § 3.1 concerns image classification, not “recognizing target data” as required by claim 17. Applicant’s argument is unpersuasive because the BRI of target data includes images to be recognized. Examiner interprets this limitation as images of numbers in the experiments in Romero §§ 3.2-3.3. Lastly, on pp. 19-20 Applicant argues that independent claims 17 and 12 are allowable for the reasons given for independent claims 1 and 18, but these arguments are unpersuasive. The rejections of claims 1-2, 4-5, 7, 10-12, and 14-20 under 35 U.S.C. 102 and 103 are maintained. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Li et al. (US 20160078339 A1) teaches a method of training a student neural network from a teacher neural network for an automatic speech recognition system. ¶ [0003] teaches that accuracy loss is a word error rate, and ¶ [0044] teaches better convergence means lower error signal and thus higher accuracy. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                         
/BRIAN M SMITH/Primary Examiner, Art Unit 2122