DETAILED ACTION
1.	This office action is in response to the Application No.  filed on 01/10/2019. Claims 1-20 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



3.	Claims 1, 2, 4-9, and 11-20 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (US20180365564) in view of Ben Kimon et al. (US20200210849 filed 12/31/18)

	Regarding claim 1, Huang teaches a computer-implemented method (a flow chart illustrating a method for training a neural network, Fig. 1 [0029]; It should be understood that each process and/or box in the flow charts and/or the box diagrams, and a combination of processes and/or blocks in the flow charts and/or the box diagrams can be implemented by the computer program instructions [0094]) comprising:
	obtaining, with at least one processor, (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	training, with at least one processor, a first model (selecting, by a training device, a teacher network performing the same functions of a student network [0038]; a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to:[0082])
	training, with at least one processor, a second model, (the processor 51 executes the at least one machine executable instruction to iteratively train the student network [0085])
	using a loss function that depends on an output of an intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale

 MMD 2 (FT, FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057] and [0050]-[0054]) and 
	an output of the second model, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	Huang does not explicitly teach first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; based on the first training data and the second training data; based on the second training data. 
	Ben Kimon teaches first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; (a combined historical transaction for payment may include a first sub-transaction of an initial transaction for payment performed at a first date (e.g., May 1, 2018), and a second sub-transaction of a reversion transaction performed at a second date (e.g., Jun. 1, 2018) [0022]; Referring to the example of FIG. 2B, the correlation features extracted from the combined historical transactions 218 and 220 are illustrated. As shown in FIG. 2B, the correction feature table 250 includes correction features 252 (e.g., “Total Refund Times,” “Relationship Period,” “Total Refund Amount,” “Refund Frequency”) with corresponding values 254 respectively (e.g., “2,” “8 months,” “$70,” and “0.25 times/month”). These correction features may be used as features associated with anomaly detection by the autoencoder model [0025]. 
Examiner notes: first training data as first sub-transaction of an initial transaction and second training data as second sub-transaction of a reversion transaction. The  initial transaction is different from reversion transaction)
training a first model based on the first training data and the second training data; (A plurality of legitimate transactions are determined from the plurality of historical reversion transactions. An autoencoder is trained using the plurality of legitimate transactions to generate a trained autoencoder capable of measuring a given transaction for similarity to the plurality of legitimate transactions [0018]; The autoencoder 300 may learn to compress an input data xi 310 (e.g., a legitimate transaction 310) [0030].
Examiner notes: legitimate transaction 310 includes initial transaction which is first training data and reversion transaction which is second training data.) 
training a second model based on the second training data. (The autoencoder 300 may be trained (e.g., using backpropagation and gradient descent) to minimize the reconstruction loss function [0031], Fig. 3.
Examiner notes: output of encoder 302 as first model in Fig. 3, output of decoder 304 as second model based on Legitimate transactions which include reversion transaction as second training data)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Huang to incorporate the teachings of Ben Kimon for the benefit of extracting features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection (Ben Kimon, [0024])

	Regarding claim 2, Modified Huang teaches the computer-implemented method of claim 1, Huang teaches wherein the second model includes at least one first layer and at least one second layer, (iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0040]. Examiner notes: student network as second model) 
	wherein the output of the second model includes an output of the at least one first layer, wherein training the second model further (the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network [0041]) and comprises:
	modifying, using the loss function that depends on the output of the intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale
 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057])
	and the output of the second model including the output of the first layer, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	one or more parameters of the at least one first layer of the second model; and training the at least one second layer based on the output of the at least one first layer (Student Network: a poor-performance single neural network with fast computation speed, which can be deployed in actual application scenes with stricter real-time requirements; it is of smaller computational cost and fewer model parameters [0018]; iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0021]; In some embodiments, as for the aforementioned step B, adjusting the weights of the student network according to the value of the objective function could be implemented as: adjusting the weights of the student network according to the value of the objective function with a gradient descent optimization function. [0055]. Examiner notes: second model is the student network)

	Regarding claim 4, Modified Huang teaches the computer-implemented method of claim 1, Huang teaches further comprising: determining, with at least one processor, (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082]))
	a plurality of information values of a plurality of intermediate layers of the first model; (the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network [0041]. Examiner notes: the first middle layers are the intermediate layer. The teacher model is the first model) and
	selecting, with at least one processor, (The high-performance teacher network with the same functions of the student network could be selected from a set of preset neural network models [0039]; a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	the intermediate layer from the plurality of intermediate layers based on the plurality of information values.(As for the neural network training scheme provided by the embodiments of the present application, on one aspect, it can train and obtain student networks with a broader application range through aligning features of middle layers of teacher networks with those of student networks [0036].Examiner notes: middle layers of the teacher network (first model) is the intermediate layers)

	Regarding claim 5, Modified Huang teaches the computer-implemented method of claim 1, Ben Kimon teaches wherein the first set of features includes complex features, (The combination process 110 may extract features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection [0024]) and 
	wherein the second set of features includes interpretable features (the input transaction 310 may have N attributes (e.g., transaction time, transaction type, payor, payee, transaction history, adjuster, age, refund amount, refund frequency, etc.) [0030])
	The same motivation to combine independent claim 1 applies here.

	Regarding claim 6, Modified Huang teaches the computer-implemented method of claim 1, Huang teaches wherein the first model includes a greater number of parameters than the second model (The teacher network is characterized by high performance and high accuracy; but, compared to the student network, it has some obvious disadvantages such as complex structure, a large number of parameters and weights, and low computation speed. The student network is characterized by fast computation speed, average or poor performance, and simple network structure [0066]; to train student networks (featuring a small amount of network parameters, poor performance and high-speed computation) [0005]. Examiner notes: the first model is the teacher network and the student network is the second model)

	Regarding claim 7, Modified Huang teaches the computer-implemented method of claim 1, Huang teaches further comprising: providing, with at least one processor, (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	the trained second model; (the processor 51 executes the at least one machine executable instruction to iteratively train the student network by using the training sample data [0087])
	obtaining, with at least one processor, (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	processing, with at least one processor and using the trained second model, (the processor 51 executes the at least one machine executable instruction to adjust the weights of the student network according to the value of the objective function, the at least one machine executable instruction being executed by the processor to: adjust the weights of the student network according to the value of the objective function with a gradient descent optimization function [0089])
	Ben Kimon teaches input data associated with at least one transaction; (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310 [0030]) 
	the input data to generate output data, wherein the output data includes a prediction of whether the at least one transaction is a fraudulent transaction. (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310. As such, the autoencoder 300 engages in dimensionality reduction, for example by learning how to ignore the noise. A reconstruction loss function may be used by the loss computation unit 308 of the autoencoder 300 to generate a reconstruction error 312 [0030], Fig. 3; the reconstruction difference 514 may be used to determine whether the first instruction is fraudulent (e.g., with a large reconstruction difference 514) or legitimate (e.g., with a small reconstruction difference 514). In the example of FIG. 5, an anomaly detector 506 receives the reconstruction error threshold for fraud 408 (e.g., from reconstruction error threshold for fraud generator 406) and the reconstruction difference 514 (e.g., from the trained autoencoder 300), and generates a fraud prediction 516 (e.g., a binary value, a probability, etc.) indicating the likelihood that the first transaction 510 is fraudulent [0038])
 	The same motivation to combine independent claim 1 applies here.

	Regarding claim 8, Huang teaches a computing system comprising: at least one processor programmed and/or configured to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	train a first model (selecting, by a training device, a teacher network performing the same functions of a student network [0038]. Examiner notes: teacher network as first model.)
	train a second model, (the processor 51 executes the at least one machine executable instruction to iteratively train the student network [0085]. Examiner notes: student network as second model)
	using a loss function that depends on an output of an intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale

 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057]) and 
	an output of the second model, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	Huang does not explicitly teach obtain first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; based on the first training data and the second training data; and based on the second training data.
	Ben Kimon teaches obtain first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; (a combined historical transaction for payment may include a first sub-transaction of an initial transaction for payment performed at a first date (e.g., May 1, 2018), and a second sub-transaction of a reversion transaction performed at a second date (e.g., Jun. 1, 2018) [0022]; Referring to the example of FIG. 2B, the correlation features extracted from the combined historical transactions 218 and 220 are illustrated. As shown in FIG. 2B, the correction feature table 250 includes correction features 252 (e.g., “Total Refund Times,” “Relationship Period,” “Total Refund Amount,” “Refund Frequency”) with corresponding values 254 respectively (e.g., “2,” “8 months,” “$70,” and “0.25 times/month”). These correction features may be used as features associated with anomaly detection by the autoencoder model [0025]. 
Examiner notes: first training data as first sub-transaction of an initial transaction and second training data as second sub-transaction of a reversion transaction. The  initial transaction is different from reversion transaction)
train a first model based on the first training data and the second training data; (A plurality of legitimate transactions are determined from the plurality of historical reversion transactions. An autoencoder is trained using the plurality of legitimate transactions to generate a trained autoencoder capable of measuring a given transaction for similarity to the plurality of legitimate transactions [0018]; The autoencoder 300 may learn to compress an input data xi 310 (e.g., a legitimate transaction 310) [0030].
Examiner notes: legitimate transaction 310 includes initial transaction which is first training data and reversion transaction which is second training data.) 
train a second model based on the second training data. (The autoencoder 300 may be trained (e.g., using backpropagation and gradient descent) to minimize the reconstruction loss function [0031], Fig. 3.
Examiner notes: output of encoder 302 as first model in Fig. 3, output of decoder 304 as second model based on Legitimate transactions which include reversion transaction as second training data)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Huang to incorporate the teachings of Ben Kimon for the benefit of extracting features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection (Ben Kimon, [0024])

	Regarding claim 9, Modified Huang teaches the computing system of claim 8, Huang teaches wherein the second model includes at least one first layer and at least one second layer, (iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0040]. Examiner notes: student network as second model) 
	wherein the output of the second model includes an output of the at least one first layer, and wherein the at least one processor is further programmed and/or configured to train the second model by: (the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network [0041]; the processor 51 executes the at least one machine executable instruction to iteratively train the student network by using the training sample data, the at least one machine executable instruction being executed by the processor to [0087])
	modifying, using the loss function that depends on the output of the intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale
 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057]) and 
	the output of the second model including the output of the first layer, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	one or more parameters of the at least one first layer of the second model; and
training the at least one second layer based on the output of the at least one first layer. (Student Network: a poor-performance single neural network with fast computation speed, which can be deployed in actual application scenes with stricter real-time requirements; it is of smaller computational cost and fewer model parameters [0018]; iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0021]; In some embodiments, as for the aforementioned step B, adjusting the weights of the student network according to the value of the objective function could be implemented as: adjusting the weights of the student network according to the value of the objective function with a gradient descent optimization function. [0055]. Examiner notes: second model is the student network)

	Regarding claim 11, Modified Huang teaches the computing system of claim 8, 
Huang teaches wherein the at least one processor is further programmed and/or configured to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	determine a plurality of information values of a plurality of intermediate layers of the first model; (the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network [0041]. Examiner notes: the first middle layers are the intermediate layer. The teacher model is the first model) and
	select the intermediate layer from the plurality of intermediate layers based on the plurality of information values. (The high-performance teacher network with the same functions of the student network could be selected from a set of preset neural network models [0039]; As for the neural network training scheme provided by the embodiments of the present application, on one aspect, it can train and obtain student networks with a broader application range through aligning features of middle layers of teacher networks with those of student networks [0036].Examiner notes: middle layers of the teacher network (first model) is the intermediate layers)

	Regarding claim 12, Modified Huang teaches the computing system of claim 8, Ben Kimon teaches wherein the first set of features includes complex features, (The combination process 110 may extract features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection [0024]) and 
	wherein the second set of features includes interpretable features (the input transaction 310 may have N attributes (e.g., transaction time, transaction type, payor, payee, transaction history, adjuster, age, refund amount, refund frequency, etc.) [0030])
	The same motivation to combine independent claim 8 applies here.

	Regarding claim 13, Modified Huang teaches the computing system of claim 8, Huang teaches wherein the first model includes a greater number of parameters than the second model (The teacher network is characterized by high performance and high accuracy; but, compared to the student network, it has some obvious disadvantages such as complex structure, a large number of parameters and weights, and low computation speed. The student network is characterized by fast computation speed, average or poor performance, and simple network structure [0066]; to train student networks (featuring a small amount of network parameters, poor performance and high-speed computation) [0005]. Examiner notes: the first model is the teacher network and the student network is the second model)

	Regarding claim 14, Modified Huang teaches the computing system of claim 8, Huang teaches wherein the at least one processor is further programmed and/or configured to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	provide the trained second model; (the processor 51 executes the at least one machine executable instruction to iteratively train the student network by using the training sample data [0087])
	Ben Kimon teaches obtain input data associated with at least one transaction; (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310 [0030]) and
	process, using the trained second model, the input data to generate output data, wherein the output data includes a prediction of whether the at least one transaction is a fraudulent transaction. (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310. As such, the autoencoder 300 engages in dimensionality reduction, for example by learning how to ignore the noise. A reconstruction loss function may be used by the loss computation unit 308 of the autoencoder 300 to generate a reconstruction error 312 [0030], Fig. 3; the reconstruction difference 514 may be used to determine whether the first instruction is fraudulent (e.g., with a large reconstruction difference 514) or legitimate (e.g., with a small reconstruction difference 514). In the example of FIG. 5, an anomaly detector 506 receives the reconstruction error threshold for fraud 408 (e.g., from reconstruction error threshold for fraud generator 406) and the reconstruction difference 514 (e.g., from the trained autoencoder 300), and generates a fraud prediction 516 (e.g., a binary value, a probability, etc.) indicating the likelihood that the first transaction 510 is fraudulent [0038])
	The same motivation to combine independent claim 8 applies here.

	Regarding claim 15, Huang teaches a computer program product comprising at least one non-transitory computer-readable medium including program instructions that, when executed by at least one processor, cause the at least one processor to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	train a first model (selecting, by a training device, a teacher network performing the same functions of a student network [0038]) 
	train a second model, (the processor 51 executes the at least one machine executable instruction to iteratively train the student network [0085])
	using a loss function that depends on an output of an intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale

 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057]) and 
	an output of the second model, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	Huang does not explicitly teach obtain first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; based on the first training data and the second training data; and based on the second training data.
	Ben Kimon teaches first training data associated with a first set of features and second training data associated with a second set of features different than the first set of features; (a combined historical transaction for payment may include a first sub-transaction of an initial transaction for payment performed at a first date (e.g., May 1, 2018), and a second sub-transaction of a reversion transaction performed at a second date (e.g., Jun. 1, 2018) [0022]; Referring to the example of FIG. 2B, the correlation features extracted from the combined historical transactions 218 and 220 are illustrated. As shown in FIG. 2B, the correction feature table 250 includes correction features 252 (e.g., “Total Refund Times,” “Relationship Period,” “Total Refund Amount,” “Refund Frequency”) with corresponding values 254 respectively (e.g., “2,” “8 months,” “$70,” and “0.25 times/month”). These correction features may be used as features associated with anomaly detection by the autoencoder model [0025]. 
Examiner notes: first training data as first sub-transaction of an initial transaction and second training data as second sub-transaction of a reversion transaction. The  initial transaction is different from reversion transaction)
train a first model based on the first training data and the second training data; (A plurality of legitimate transactions are determined from the plurality of historical reversion transactions. An autoencoder is trained using the plurality of legitimate transactions to generate a trained autoencoder capable of measuring a given transaction for similarity to the plurality of legitimate transactions [0018]; The autoencoder 300 may learn to compress an input data xi 310 (e.g., a legitimate transaction 310) [0030].
Examiner notes: legitimate transaction 310 includes initial transaction which is first training data and reversion transaction which is second training data.) 
	train a second based on the second training data. (The autoencoder 300 may be trained (e.g., using backpropagation and gradient descent) to minimize the reconstruction loss function [0031], Fig. 3.
Examiner notes: output of encoder 302 as first model in Fig. 3, output of decoder 304 as second model based on Legitimate transactions which include reversion transaction as second training data)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Huang to incorporate the teachings of Ben Kimon for the benefit of extracting features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection (Ben Kimon, [0024])

	Regarding claim 16, Modified Huang teaches the computer program product of claim 15, Huang teaches wherein the second model includes at least one first layer and at least one second layer, (iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0040]. Examiner notes: student network as second model) 
	 wherein the output of the second model includes an output of the at least one first layer, and wherein the instructions further cause the at least one processor to train the second model by: (the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network [0041]; the processor 51 executes the at least one machine executable instruction to iteratively train the student network by using the training sample data, the at least one machine executable instruction being executed by the processor to [0087])
	modifying, using the loss function that depends on the output of the intermediate layer of the first model (In equation (5), 
    PNG
    media_image1.png
    38
    38
    media_image1.png
    Greyscale
(ytrue,pS) refers to the cross-entropy loss function, 
    PNG
    media_image2.png
    38
    29
    media_image2.png
    Greyscale
 MMD 2 (FT,FS) refers to the distance loss function, λ refers to the weight of distance loss function, FT refers to the feature map (i.e. the features of the first middle layer) output from the first specific network layer of the teacher network given the training sample data, [0057]) and 
	the output of the second model including the output of the first layer, (FS refers to the feature map (i.e. the features of the second middle layer) output from the second specific network layer of the student network given the training sample data, [0057])
	one or more parameters of the at least one first layer of the second model; and
training the at least one second layer based on the output of the at least one first layer. (Student Network: a poor-performance single neural network with fast computation speed, which can be deployed in actual application scenes with stricter real-time requirements; it is of smaller computational cost and fewer model parameters [0018]; iteratively training the student network and obtaining a target network, through aligning distributions of features between a first middle layer and a second middle layer corresponding to the same training sample data [0021]; In some embodiments, as for the aforementioned step B, adjusting the weights of the student network according to the value of the objective function could be implemented as: adjusting the weights of the student network according to the value of the objective function with a gradient descent optimization function. [0055]. Examiner notes: second model is the student network)

	Regarding claim 17, Modified Huang teaches the computer program product of claim 15, Huang teaches wherein the instructions further cause the at least one processor to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	determine a plurality of information values of a plurality of intermediate layers of the first model; (the features of the first middle layer refer to feature maps output from a first specific network layer of the teacher network after the training sample data are provided to the teacher network [0041]. Examiner notes: the first middle layers are the intermediate layer. The teacher model is the first model) and
	select the intermediate layer from the plurality of intermediate layers based on the plurality of information values. (The high-performance teacher network with the same functions of the student network could be selected from a set of preset neural network models [0039]; As for the neural network training scheme provided by the embodiments of the present application, on one aspect, it can train and obtain student networks with a broader application range through aligning features of middle layers of teacher networks with those of student networks [0036].Examiner notes: middle layers of the teacher network (first model) is the intermediate layers)
	Regarding claim 18, Modified Huang teaches the computer program product of claim 15, Ben Kimon teaches wherein the first set of features includes complex features, (The combination process 110 may extract features (e.g., correlation features associated with correlations of the sub-transactions) associated with the autoencoder for anomaly detection [0024]) and 
	wherein the second set of features includes interpretable features (the input transaction 310 may have N attributes (e.g., transaction time, transaction type, payor, payee, transaction history, adjuster, age, refund amount, refund frequency, etc.) [0030])
	The same motivation to combine independent claim 15 applies here.

	Regarding claim 19, Modified Huang teaches the computer program product of claim 15, Huang teaches wherein the first model includes a greater number of parameters than the second model (The teacher network is characterized by high performance and high accuracy; but, compared to the student network, it has some obvious disadvantages such as complex structure, a large number of parameters and weights, and low computation speed. The student network is characterized by fast computation speed, average or poor performance, and simple network structure [0066]; to train student networks (featuring a small amount of network parameters, poor performance and high-speed computation) [0005]. Examiner notes: the first model is the teacher network and the student network is the second model)

	Regarding claim 20, Modified Huang teaches the computer program product of claim 15, Huang teaches wherein the instructions further cause the at least one processor to: (a device for training a neural network, the structure of the device is illustrated in FIG. 5, which including: a processor 51 and at least one memory 52, the at least one memory storing at least one machine executable instruction, which is executed by the processor to: [0082])
	provide the trained second model; (the processor 51 executes the at least one machine executable instruction to iteratively train the student network by using the training sample data [0087])
	process, using the trained second model, (the processor 51 executes the at least one machine executable instruction to adjust the weights of the student network according to the value of the objective function, the at least one machine executable instruction being executed by the processor to: adjust the weights of the student network according to the value of the objective function with a gradient descent optimization function [0089])
	Ben Kimon teaches obtain input data associated with at least one transaction; (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310 [0030])
	 the input data to generate output data, wherein the output data includes a prediction of whether the at least one transaction is a fraudulent transaction. (The decoder 304 may uncompress that latent representation En(xi) into a reconstructed data 314 (denoted as De(En(xi))) that closely matches the input data xi 310. As such, the autoencoder 300 engages in dimensionality reduction, for example by learning how to ignore the noise. A reconstruction loss function may be used by the loss computation unit 308 of the autoencoder 300 to generate a reconstruction error 312 [0030], Fig. 3; the reconstruction difference 514 may be used to determine whether the first instruction is fraudulent (e.g., with a large reconstruction difference 514) or legitimate (e.g., with a small reconstruction difference 514). In the example of FIG. 5, an anomaly detector 506 receives the reconstruction error threshold for fraud 408 (e.g., from reconstruction error threshold for fraud generator 406) and the reconstruction difference 514 (e.g., from the trained autoencoder 300), and generates a fraud prediction 516 (e.g., a binary value, a probability, etc.) indicating the likelihood that the first transaction 510 is fraudulent [0038])
	The same motivation to combine independent claim 15 applies here.

4.	Claims 3 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (US20180365564) in view of Ben Kimon et al. (US20200210849 filed 12/31/18) and further in view of Caelen et al (US20200257964 filed 07/13/2018)

	Regarding claim 3, Modified Huang teaches the computer-implemented method of claim 2, Huang teaches wherein the first model includes at least one of the following: a deep neural network, a recurrent neural network, an ensemble of a plurality of neural networks, or any combination thereof, (As for the neural network training scheme provided by the embodiments of the present application, on one aspect, it can train and obtain student networks with a broader application range through aligning features of middle layers of teacher networks with those of student networks [0036]. Examiner notes: teacher network as first model)
	wherein the first layer of the second model includes a regression neural network, (when the task of the student network is a regression task, the form of the task specific loss function is a distance loss function [0072]; the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network [0068]; the second specific network layer is a middle network layer or the last network layer of the student network [0069])
	Modified Huang does not explicitly teach wherein the second layer of the second model includes a logistic regression model.
	Caelen teaches wherein the second layer of the second model includes a logistic regression model (The distribution over classes fraud and non-fraud given state st is modeled with a logistic regression output model [0041])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Caelen for the benefit of taking into account the time elapsed between two authentication, operation or transactions (Caelen, [0017])

	Regarding claim 10, Modified Huang teaches the computer-implemented method of claim 9, Huang teaches wherein the first model includes at least one of the following: a deep neural network, a recurrent neural network, an ensemble of a plurality of neural networks, or any combination thereof, (As for the neural network training scheme provided by the embodiments of the present application, on one aspect, it can train and obtain student networks with a broader application range through aligning features of middle layers of teacher networks with those of student networks [0036]. Examiner notes: teacher network as first model)
	wherein the first layer of the second model includes a regression neural network, (when the task of the student network is a regression task, the form of the task specific loss function is a distance loss function. [0072]; the features of the second middle layer refer to feature maps output from a second specific network layer of the student network after the training sample data are provided to the student network [0068]; the second specific network layer is a middle network layer or the last network layer of the student network [0069]) and 
	Modified Huang does not explicitly teach wherein the second layer of the second model includes a logistic regression model.
	Caelen teaches wherein the second layer of the second model includes a logistic regression model (The distribution over classes fraud and non-fraud given state st is modeled with a logistic regression output model [0041])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Caelen for the benefit of taking into account the time elapsed between two authentication, operation or transactions (Caelen, [0017])

	

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2121                                    

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121