DETAILED ACTION
This Final Office Action is responsive to Applicant’s Amendment filed on 23 November 2020 in which claims status is:
Amended: Claims 102, 104-107, 113, 116-119, and 123
Canceled: Claims 109-112, 114-115, 120-122, and 124-125
Claims 102-108, 113, 116-119, and 123 are currently pending and under examination, of which claims 102, 113, 116, and 123 are independent claims. Method and systems for similar claim language is found between group of claims 102/116 and separately, group of claims 113/123. No claims are presently in condition for allowance.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
As required by M.P.E.P. 609(c), the applicant’s submissions of the Information Disclosure Statement dated 09/30/2020 is acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending. As required by M.P.E.P. 609 C(2), a copy of the PTOL-1449 initialed and dated by the examiner is attached to the instant office action.

Response to Remarks
Examiner thanks inventor and attorney for interview discussion 11/18/2020.
Objection to claims 104 and 117 are withdrawn as necessitated by applicant’s amendments.
The prior rejection under Double Patenting is hereby withdrawn subsequent to express abandonment of application 16/930,085 upon which the rejection was based.
Applicant’s remarks dated 11/23/2020 regarding the prior art have been considered, but they are moot in view of the new grounds of rejection as necessitated by applicant’s amendments. Updated search and consideration is accorded to reflect present status of claims. 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 

Claims 102 and 116 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-39 of U.S. Patent No. 10,832,137B2. Although the claims at issue are not identical, they are not patentably distinct from each other because both claim sets are directed to cover of nodes in merging of neural networks. See the following comparison table: 
Instant Application: 16/903,980
US Patent No: 10,832,137B2
Claim 102. 
A method for training a parent neural network, the method comprising: 


identifying, by a computer system, a pair of nodes, nodes A and B, of the parent neural network where both node B covers node A and, in the training of the parent neural network, a magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value; 

training, by the computer system, through machine learning, an imitation machine leaning network to imitate node B, wherein training the imitation machine learning network to imitate node B comprises training the imitation machine learning network to make a same output as node B on training data examples in a set of training data, 




adding, by the computing system, the imitation machine learning network to the parent neural network such that node A of the parent neural network covers the imitation machine learning network; and 



after adding the imitation machine learning network to the parent neural network, resuming training the parent neural network.

A computer-implemented method for merging first and second nodal networks […] the first and second network comprise knowledge, the method comprising: 

merging, by a computer system, the first and second nodal networks into the merged network such that, prior to training of the merged network, no node in the first nodal network covers a node in the second nodal network and no node in the second nodal network covers a node in the first nodal network; 

evaluating, by the computer system, potential cross-connections between the first and second nodal network in the merged network, wherein each potential cross-connection is an arc between a node in the first nodal network and a node in the second nodal network, and wherein the evaluation of the potential cross-connection is based on an estimated improvement in an objective of the merged network that includes the evaluated potential cross-connection; 

adding, by the computer system, at least one potential cross-connection to the merged network based, at least in part, on the evaluation such that, after the at least one potential cross- connection is added to the merged network, a node in the second nodal network covers a node in the first nodal network; and 

after adding the at least one potential cross-connection to the merged network, training, by the computer system, the merged network.

Claim 5. 
The computer-implemented method of claim 4, wherein adding the at least one potential cross-connection comprises adding a potential cross-connection to the merged network upon a determination that the gradient cross-product for the e potential cross-connection exceeds a threshold value.

A computer system for training a parent neural network, the computer system comprising: 



one or more processor cores; and 
a memory in communication with the one or more processor cores, wherein the memory stores instructions that when executed by the one or more processor cores cause the one or more processor cores to: 

identify a pair of nodes, nodes A and B, of the parent neural network where both node B covers node A and, in the training of the parent neural network, a magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value; 

train, through machine learning, an imitation machine leaning network to imitate node B by training the imitation machine learning network to make a same output of node B on training data examples in a set of training data, 





add the imitation machine learning network to the parent neural network such that node A of the parent neural network covers the imitation machine learning network; and 


after adding the imitation machine learning network to the parent neural network, resume training the parent neural network.
Claim 23. 
A computer system for merging first and second nodal networks to create a merged network, […] the first and second network comprise knowledge, the computer system comprising: 

one or more processor cores; and a memory in communication with the one or more processor cores, wherein the memory stores software that when executed by the one or more processors cause the one or more processor cores to: 

merge the first and second nodal networks into the merged network such that, prior to training of the merged network, no node in the first nodal network covers a node in the second nodal network and no node in the second nodal network covers a node in the first nodal network; 

evaluate potential cross-connections between the first and second nodal network in the merged network, wherein each potential cross-connection is an arc between a node in the first nodal network and a node in the second nodal network, and wherein the evaluation of the potential cross-connection is based on an estimated improvement in an objective of the merged network that includes the evaluated potential cross- connection; 

add at least one potential cross-connection to the merged network based, at least in part, on the evaluation such that, after the at least one potential cross-connection is added to the merged network, a node in the second nodal network covers a node in the first nodal network; and 

after adding the at least one potential cross-connection to the merged network, train the merged network.

Claim 27. 
The computer system of claim 127, wherein the software stored in the memory further causes the one or more processor cores to add the at least one potential cross-connection by adding a potential cross-connection to the merged network upon a determination that the gradient cross-product for the e potential cross-connection exceeds a threshold value.
Examiner notes identical specifications between the instant application and issued patent.


Claim 102 is provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-22 of copending Application No. 17/068,472. Although the claims at issue are not identical, they are not patentably distinct from each other because both claim sets are directed to cover of nodes in merging of neural networks
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented. See the following comparison table: 
Instant Application: 16/903,980
Copending Application: 17/068,472
Claim 102. 
A method for training a parent neural network, the method comprising: 






identifying, by a computer system, a pair of nodes, nodes A and B, of the parent neural network where both node B covers node A and, in the training of the parent neural network, a magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value; 


training, by the computer system, through machine learning, an imitation machine leaning network to imitate node B, wherein training the imitation machine learning network to imitate node B comprises training the imitation machine learning network to make a same output as node B on training data examples in a set of training data, 




adding, by the computing system, the imitation machine learning network to the parent neural network such that node A of the parent neural network covers the imitation machine learning network; and 



            after adding the imitation machine learning network to the parent neural network, resuming training the parent neural network.
Claim 1. 
A method for training one or more neural networks, the method compromising […]

Claim 9. 
[…]
merging the first and second neural networks comprises: merging, by a computer system, the first and second neural networks into the merged network such that, prior to training of the merged network, no node in the first neural network covers a node in the second neural network and no node in the second neural network covers a node in the first neural network; 

evaluating, by the computer system, potential cross-connections between the first and second neural network in the merged network, wherein each potential cross- connection is an arc between a node in the first neural network and a node in the second neural network, and wherein the evaluation of the potential cross- connection is based on an estimated improvement in an objective of the merged network that includes the evaluated potential cross-connection; 

adding, by the computer system, at least one potential cross-connection to the merged network based, at least in part, on the evaluation such that, after the at least one potential cross-connection is added to the merged network, a node in the second neural network covers a node in the first neural network; and 

after adding the at least one potential cross-connection to the merged network, training, by the computer system, the merged network.


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 102-108, 113, 116-119, and 123 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. In determining whether the claims are subject matter eligible, the examiner applies eligibility analysis as set forth in MPEP 2106. 
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—all claims fall within one of the four statutory categories. Claims 102-108 and 113 are method/process and claims 116-119 and 123 are system/machine.
Step 2A, prong one: Does the claim recite an abstract idea, law of nature or natural phenomenon? Yes—the claims are directed to an abstract idea of math. Specific limitations comprise: 
“magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value”
“node B covers node A”. Specification [0047] describes “cover” w.r.t. partially ordered set.
“adding… the imitation machine learning network to the parent neural network”
“activations of an output node N of the imitation machine learning network have a high vector correlation with derivative vectors”. Remarks 11/23/2020 [Page 10] describe the vector correlation as “mathematical term” w.r.t. cosine of the angle between vectors.
These claim limitations amount to a process that, under its broadest reasonable interpretation, covers mathematical calculations which falls within the grouping of mathematical concepts as one of the enumerated judicial exceptions. Therefore, the claim recites an abstract idea.
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application. Although the claim recites training of a computer system for imitation among neural networks, the functionality amounts to linking the judicial exception to a field of use or technological environment, see MPEP 2106.05(h). In consideration of said technological environment, examiner notes support as evidenced by US PGPub No 20160078339A1 cover page is Fig 3 (below) which illustrates known environment of student/teacher neural networks at a time effectively filed nearly four years prior to the instant application earliest priority. Four years is a significant timeframe in the rapidly advancing field of machine learning. Scope of the claim amounts to specified math for the identified environment so as to modify training with alternative loss. 
Examiner anticipates applicant rebuttal of improvement to the functioning of a computer, and to this regard would counter in noting that a particular nexus is not identified between the claim and any alleged improvement. The specification does not reveal unexpected results such as by performance evaluation on baseline datasets. To the contrary, the specification notes [0111] “There is no immediate improvement in performance”, and examiner notes that the adding of relative networks without consideration of size may actually present an intractable solution. Regardless, the specification is relied upon for a wide family of applications and thus is not readily distilled into a specific improvement scientifically tied to the present claims. In fact, the motivation for such arrangement appears to be absent which is largely the reason that rejection under 35 U.S.C 101 is appropriate.
Further, the claim recites that the recited functionality is performed by “memory in communication with processor cores”, the recited memory and processors are recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component. 

    PNG
    media_image1.png
    519
    687
    media_image1.png
    Greyscale

Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—as noted above, additional elements are identified as field of use per MPEP 2106.05(h). The courts have identified such elements as failing to add significantly more to the abstract idea. For instance, the courts have often cited Parker v. Flook, 437 U.S. at 586-90, 198 USPQ at 196-98 in finding of ineligibility for mathematical formula in a field of use. A more recent example offers insight into findings with regard to improvement to the function of a computer, whereupon courts upheld abstract ineligibility, see Simio, LLC. v. Flexsim Software Products, Inc. (Fed. Cir. 2020).
For the reasons above, independent claims 102, 113, 116, and 123 are rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claims 103-108 and 117-119.
Dependent claim 103 discloses wherein training data comprises labeled, unlabeled, and generated examples “by a data generator”. Examiner finds the specification silent on what constitutes the claimed generator and is interpreted as amounting to the network. The training examples amount to non-functional descriptive matter and/or mere data gathering.
Dependent claims 104 and 117 disclose wherein parent neural network has a sub-network culminating in node B, and imitation network comprises more nodes than sub-network. This further describes the field of use, and is illustrated by supporting evidence ‘339 as distribution of nodes.
Dependent claims 105-106 and 118-19 disclose first limitation similar to claim 104 and further that relative networks are described by “superset of capabilities” or “proper subset of capabilities”. There is no clear definition of what a capability should be interpreted as, let alone what constitutes superset and subset thereof, or how a set is to be considered proper. Accordingly, this is considered part of the abstract idea as mathematical relationship relating to cover among partially ordered sets.
Dependent claim 107 discloses first limitation similar to claim 104 and further “the imitation machine learning network comprises a different network topology that the sub-network”. This further describes the field of use, and is illustrated by supporting evidence ‘339 as distribution of nodes. Further, the specification is completely silent as to topology.
Dependent claim 108 discloses wherein the imitation learning network comprises a self-organizing partially ordered network. The label of network type as described amounts to non-functional descriptive matter which does not impose any limitation other than naming the network. Accordingly, the claim does not provide meaningful limitation, see MPEP 2111.05 and 2106.05(e).
Taken alone, their additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no evidence that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 102, 104-108, 113, 116-119 are rejected under 35 U.S.C. 103 as being unpatentable over: 
You et al., “Learning from Multiple Teacher Networks”, hereinafter You, in view of 
Zagoruyko et Komodakis, “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, hereinafter Zagoruyko, in view of 
Luo et al., “Face Model Compression by Distilling Knowledge from Neurons”, hereinafter Luo, in view of 
Matsuda et al., US PG Pub No 20160110642A1, hereinafter Matsuda.
With respect to claim 102, You teaches: 
A method for training a parent neural network {You [P.1289] illustrated Figure 1 student/teacher neural networks, “method to train”}, the method comprising: 
identifying, by a computer system, a pair of nodes, nodes A and B, of the parent neural network where both node B covers node A and, 
training, by the computer system, through machine learning, an imitation machine leaning network to imitate node B, wherein training the imitation machine learning network to imitate node B comprises training the imitation machine learning network to make a same output as node B on training data examples in a set of training data, 

    PNG
    media_image2.png
    428
    679
    media_image2.png
    Greyscale

You Figure 1 illustrates student teacher neural network architecture with joint learning and plurality of teachers, pairwise dissimilarity, and distillation of dark knowledge which regularizes intermediate layers for transfer learning. The purpose of such an environment is to find the best teacher for a student in relative neural networks. Equation 7 establishes a combined loss which includes relative dissimilarity. 
Terminology of the claim is accorded broadest reasonable interpretation. Imitation learning is akin to “mimic”, and claimed same output is by approximation, see [P.1286 ¶1] “During training, the student not only approximates the output of the teacher…” and [P.1287 Sect2.3 ¶1] “encourage the student to mimic the outputs of the layer before the softmax layer in the teacher”. Cover in the context of You is [P.1288 Sect3.1 ¶3] “quantize the dissimilarity between two examples’ intermediate output as their distance according to some distance metric… Given a triplet (pi, pi+, pi-) where pi is the anchor point, there exists a partially ordered relation in terms of relative similarity, i.e. pi+ ≻pi pi- … determine the partially ordered relation according to the teacher network”. The pairwise dissimilarity for a point, or activation, suggests nodal cover. Generally, it is known that convolutional layers comprise nodes. For instance, [P.1287 Sect2.3 ¶2-4] details activations (as would be understood being activation of a node) where weight parameter applied to cross-entropy for label vectors between student and teacher. One would be motivated to read You in view of the claimed limitations because it considers “loss imposed on the intermediate layer of the student network” (You [P.1288 Last¶]) which further “serve as stimulations for the gradients, and lead to good convergence behavior” (You [P.1287 Sect.2 Last¶]).
However, You does not expressly disclose “magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value” as most algorithms automatically compute gradient. Further support is provided in order to establish nodal discrimination.
Zagoruyko teaches: 
and, in the training of the parent neural network, a magnitude of a partial derivative of a potential connection weight from node A to node B is greater than a threshold value; 
Zagoruyko teaches student/teacher neural networks per Figs 1b and 5. Partial derivative calculation is presented for both student and teacher per [P.6] Equation 3, where notation is S=student, T=teacher, L=loss, W=weight. The objective is to describe a total loss whereby [P.5 ¶2] “transfer losses are placed between student and teacher”. Further, [P.3 Sect3.1] “the absolute value of a hidden neuron activation (that results when the network is evaluated on a given input) can be used as an indication about the importance of that neuron… max of absolute values” teaches magnitude w.r.t. a threshold (i.e., max). See also [P.6 Sect3.2 ¶2] “student gradient attention to be similar to teacher attention”.
Zagoruyko is directed to machine learning for training student/teacher models thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to detail the gradient of You as derived by Zagoruyko “in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network” (Zagoruyko [Abstract]) and/or for the reason “puts more weight to spatial locations that correspond to the neurons with the highest activations” (Zagoruyko [P.4 ¶2]).
However, the combination of You and Zagoruyko does not appear to disclose the step of adding networks at the node level.
Luo teaches: 
adding, by the computing system, the imitation machine learning network to the parent neural network such that node A of the parent neural network covers the imitation machine learning network; and 
Luo discloses student/teacher networks with key teaching whereby [P.3561 RtCol] “The merit behind our method is to select informative neurons in the top hidden layer of a teacher, and adopt the features (responses) of the chosen neurons as supervision to train a student, mimicking the teacher’s feature space… We formulate neuron selection as an inference problem on a fully-connected graph, where each node represents a neuron and each edge represents the correlation between a pair of neurons… pairwise costs of selecting neuron i”. That is, Luo selects a neuron identified upon pairwise cost for mimic learning among student/teacher networks. 
Luo is directed to machine learning with student/teacher model training thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to implement efficient neuron selection disclosed by Luo in combination with You, and Zagoruyko in order to “select neurons, which are more discriminative” (Luo [P.3562 ¶1]) and/or “with neuron selection, the student S is able to outperform its corresponding teacher models by using much fewer parameters and process much faster” (Luo [P.3565 ¶2]). Furthermore, the correlation between pair of neurons disclosed by Luo would have been obvious to implement as cover disclosed by You as applying a known technique to a known method to yield predictable results and/or as “pruned the unimportant neurons” (You [P.1286 Sect2.1]).
Finally, the combination of You, Zagoruyko, and Luo does not disclose training which is “resumed”, “after” the adding step, e.g., freeze/thaw parameter training.
Matsuda teaches:
after adding the imitation machine learning network to the parent neural network, resuming training the parent neural network.
	Matsuda discloses holding parameters fixed during training across networks per [0058] “training the DNN while fixing the parameters of independent sub-network 120” again at [0048], [0019], and illustrated Fig 10:1010. 
	Matsuda is directed to machine learning training optimization thus being analogous. A person having ordinary skill in the art would have considered it obvious to hold training parameters fixed as disclosed by Matsuda in combination with above for the reason that “By fixing parameters of independent sub-network, training of a sub-network for images of a category not used for learning is possible… time for learning can also be shortened” (Matsuda [0086]) and/or for consideration of computational resources per “training of dependent sub-network 234 is far less burdensome than training of DNN as a whole” (Matsuda [0049]).

With respect to claim 104, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102. Li teaches wherein: 
	the parent neural network, prior to the addition to of the imitation machine learning network, had a sub-network culminating in node B {You Fig 1 where [P.1287 ¶3] “softened outputs… output layer” of student and teacher networks, culminating node is inherent in output layer. Alternatively, see Matsuda cover page}; and the imitation machine learning network comprises more nodes than the sub-network {You [P.1287 Sect2.3 ¶3] “training a deeper student network” deeper suggests more nodes}.

With respect to claim 105, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102, wherein: 
	the parent neural network, prior to the addition to of the imitation machine learning network, had a sub-network culminating in node B {You Fig 1 where [P.1287 ¶3] “softened outputs… output layer” of student and teacher networks, culminating node is inherent in output layer. Alternatively, see Matsuda cover page}; and 
capabilities of the imitation machine learning network are a superset of capabilities of the sub-network {You [P.1291 Last¶] “The smallest Student1 achieves the greatest compression rate”, Tbl1}.

With respect to claim 106, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102, wherein: 
	the parent neural network, prior to the addition to of the imitation machine learning network, had a sub-network culminating in node B {You Fig 1 where [P.1287 ¶3] “softened outputs… output layer” of student and teacher networks, culminating node is inherent in output layer. Alternatively, see Matsuda cover page}; and 
capabilities of the imitation machine learning network are a proper subset of capabilities of the sub-network {You [P.1291] Table 1 where last column details student having lower accuracy}.

With respect to claim 107, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102, wherein: 
	the parent neural network, prior to the addition to of the imitation machine learning network, had a sub-network culminating in node B {You Fig 1 where [P.1287 ¶3] “softened outputs… output layer” of student and teacher networks, culminating node is inherent in output layer. Alternatively, see Matsuda cover page}; and 
the imitation machine learning network comprises a different network topology than the sub-network {You [P.1291] Table 1 details student and teacher having different layer and parameter composition. See also Zagoruyko [P.9 Sect4.2.2 ¶2] “teacher and student have different architecture”}.

With respect to claim 108, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102, wherein:  
the imitation machine learning network comprises a self-organizing partially ordered network {You [P.1289 Sect3.2 ¶1] “From Eq.(6) we can see the determination of the triplet’s partially ordered relationship is the core of the knowledge transfer from a teacher network into the student network”; [P.1288 Sect3.1 ¶3]; [P.1290 ¶1] “self-regulated learning”}.

With respect to claim 113, the rejection of claim 102 is incorporated herein. Further consideration is given to limitation wherein 
training, by the computer system, through machine learning, an imitation machine learning network such that, for a plurality of training data items, activations of an output node N of the imitation machine learning network have a high vector correlation with derivative vectors of node A in the parent neural network 
Zagoruyko teaches student/teacher neural network training with distillation/transfer learning illustrated Figs 5 and 1b whereby [P.5 Last¶] “the j-th pair of student and teacher attention maps in vectorized form” and correlation is with total loss equation 2. Derivative derivations are provided [P.6], special computation node is at least [P.3 Sect3.1 ¶2] “absolute value of a hidden neuron activation”, and the guiding rationale is provided [P.4 ¶2] “puts more weight to spatial locations that correspond to the neurons with highest activations” so as to [Abstract] “significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network”.

With respect to claim 116, the rejection of claim 102 is incorporated herein. You teaches:
	A computer system for training a parent neural network, the computer system comprising: 
	one or more processor cores; and 
	a memory in communication with the one or more processor cores, wherein the memory stores instructions that when executed by the one or more processor cores cause the one or more processor cores {You Fig 1, [P.1285 Sect.1 ¶1] “memory storage and computation resources… computation devices, such as smartphones and tablets”. See also Luo [P.3565 ¶2] “implementation on a Intel Core 2.0 GHz CPU… 4 megabytes storage”} to:

Claims 117-119 are rejected for the same rationale as claims 104-106, respectively.

With respect to claim 123, the rejection of claim 113 is incorporated herein. You teaches:
	A computer system for training a parent neural network, the computer system comprising: 
	one or more processor cores; and 
	a memory in communication with the one or more processor cores, wherein the memory stores instructions that when executed by the one or more processor cores cause the one or more processor cores {You Fig 1, [P.1285 Sect.1 ¶1] “memory storage and computation resources… computation devices, such as smartphones and tablets”. See also Luo [P.3565 ¶2] “implementation on a Intel Core 2.0 GHz CPU… 4 megabytes storage”} to:

Claim 103 is rejected under 35 U.S.C. 103 as being unpatentable over You, Zagoruyko, Luo, Matsuda, in view of 
Xu et al., “Learning Loss for Knowledge Distillation with Conditional Adversarial Networks”, hereinafter Xu.
With respect to claim 103, the combination of You, Zagoruyko, Luo and Matsuda teaches the method of claim 102, wherein the training data examples in the set of training data comprises: 
	labeled training data examples; unlabeled training data examples {You [P.1290 RtCol] “semi-supervised… unlabeled examples… labeled examples”}; and 
	However, the combination of You, Zagoruyko, Luo and Matsuda does not disclose a “generator”.
	Xu teaches:
training data examples generated by a data generator {Xu [P.3 Sect3.3] Fig 2 illustrates GAN generative adversarial network for student/teacher training, comprising [P.2 Sect.2 Last¶2] “architecture choices for our generator and discriminator”}.
	Xu is directed to machine learning for training of student/teacher models thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to generate training examples via Xu’s GAN as design/architecture choices as disclosed and/or to produce labels aligned between student/teacher and which have been discriminated as real/fake (Xu [P.4 ¶3-5]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
Huang et al., TuSimple discloses three documents from 2017 which provide strong basis for the scope of application. US PG Pub No 20180365564A1, “Like What You Like: Knowledge Distill via Neuron Selectivity Transfer”, “Data-Driven Sparse Structure Selection for Deep Neural Network”. This combination details student/teacher mimic nets with task specific loss, convergence conditions, neuron selectivity between networks, and neuron pruning. TuSimple appears to be the closest potential corp litig., see claims 1, 5.
Tyukin et al., “Knowledge Transfer Between Artificial Intelligence Systems” discloses knowledge transfer from internal representation of student/teacher networks and with cosine correlation, see [P.10] Figure 4 and [P.3] Equation 2.
Wang et al., US PG Pub No 20190197404A1 expressly discloses magnitude of partial derivative and minimizing loss or maximizing utility function, see math [0053], [0059].
Bach et al., US PG Pub No 20180018553A1 discloses [0050] “neuron functions are parameterizable and the function parameters may differ among the neurons”
Wang et al., “Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification” discloses [Abstract] “student network can be competitive to the teacher one in alignment and verification, and even surpass the teacher”.
Kangas et al., “Counting Linear Extensions of Sparse Posets” discloses cover, see [Sect1.2].
Anonymous, “Self-Organization Adds Application Robustness to Deep Learners” discloses internal self-organization with weight vector for hidden neuron. Hartono et Trappenberg.
Mishra et al., “Apprentice: Using Knowledge Distillation Techniques to Improve Low-Precision Network Accuracy” discloses student/teacher joint training, see [P.2] Fig 2 and [P.2 ¶2] “architecture of the student network is typically different from that of the teacher”.
Duan et al., “One-Shot Imitation Learning” disclosure OpenAI.
Stadie et al., “Third-Person Imitation Learning” see Fig 2.
Chen et al., “On Sampling Strategies for Neural Network-based Collaborative Filtering” discloses pointwise and pairwise loss defined over links, see [P.770] Fig 3.
Ravi et al., US Patent No 10,748,066B2 Google patent discloses projection neural networks with alternative loss, see Fig 3 and [Col4 Line35] “mimic the predictions”.
Torkamani et al., US PG Pub No 20190114531A1 discloses differential equations network to learn activation function for a neuron including cosine activation function.
Audiffren et al., “Bandits Dueling on Partially Ordered Sets” discloses poset with regret bounds and transitivity relaxation.
Zhuo et al., “Deep Unsupervised Convolutional Domain Adaptation” discloses correlation alignment loss as building on Zagoruyko, see Fig 2 and Algorithm 1.
Ruder et al., “Knowledge Adaptation: Teaching to Adapt” discloses multiple teacher-student model with cosine correlation, see Equations 4 and 10.
Zhan et al., “Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer” discloses grand-teacher.
Tarvainen et Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results” Curious AI Company discloses mean teacher method, see Fig 2.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935.  The examiner can normally be reached on M-F 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached on 571-270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/CHASE P. HINCKLEY/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124