Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	Claim 1-13 are pending.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: ‘a reading unit’ and ‘a determining unit’ in claim 12.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-13 are rejected under U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1, 
2A Prong 1: The limitation of reading a layer structure and parameters of layers from each of models of two neural networks is a mental process, because the limitation merely recites a process of reading data. The limitation of determining a degree of matching between the models of the two neural networks, by comparing layers, that are configured as a graph-like form in respective hidden layers, in order from an input layer based on similarities between respective layers is a mental process, as the limitation recites a process of comparing similarities between two different neural networks by comparing each of the layers. The limitation of using breadth first search or depth first search is also a mental process, because breadth first search and depth first search are the method of searching through a neural network structure or a tree.
2A Prong 2: This judicial exception is not integrated into a practical application. 
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The respective models of the two neural networks are form of field of use and technological environment (MPEP 2106.05(h)). 
Claim 12 is an apparatus claim having similar limitation to method claim 1. Therefore, rejected using the same rationale as claim 1 above.
Regarding claim 13, the limitation of a computer readable storage medium storing a program, the program causes, upon being executed by one or more processors of a computer is a generic computer component, because computer readable storage medium storing a program without any detail is a component of common personal computer system. Claim 13 is a computer readable storage medium claim having similar limitation to method claim 1. Therefore, rejected using the same rationale as claim 1 above.

Regarding claim 2, the limitation of wherein the determining the degree of matching between the models of the two neural networks includes, if types of layers to be compared are different, setting a similarity to 0, and not performing comparison in layers in a later stage than the layers subjected to the comparison is mental process, because the limitation merely recites the process of not comparing two data if they are in different layers.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

	Regarding claim 5, the limitation of wherein the determining the degree of matching between the models of the two neural networks includes, when full-connected layers are compared is a mental process, because the limitation merely recites a process of comparing two models and determine how similar the models are, which can be done in human mind or with the aid of pen and paper. 
The limitation of expressing each of the full-connected layers, by regarding weights of each of the full-connected layers as a feature vector, using a vector set is a mental process, because the limitation merely recites converting the data from a layer to a vector. 
The limitation of setting a similarity between the vector sets of the respective full- connected layers to be compared as a similarity between the full-connected layers is also a mental process, because the limitation merely recites setting the result of the calculation as a final similarity value.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

	Regarding claim 8, the limitation of wherein the determining the degree of matching between the models of the two neural networks includes, when activation layers are compared, setting, if types of the activation layers are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity is a mental process, because the limitation merely recites a process of comparing two models and determine how similar the models are and setting the similarity value using the determined values, which can be done in human mind or with the aid of pen and paper.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

	Regarding claim 9, the limitation of wherein the type of the activation layer is a linear connection, a sigmoid function, a hard sigmoid function, a tanh function (hyperbolic tangent function), a softsign function, a softplus function, or a ReLU (Rectified Linear Unit) is a field of use and technological environment (MPEP 2106.05(h)), as it merely specifies the type of function it uses.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

	Regarding claim 10, the limitation of wherein the determining the degree of matching between the models of the two neural networks includes, when pooling layers are compared, setting, if types of the pooling layers are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity is a mental process, because the limitation merely recites a process of comparing two models and determine how similar the models are and setting the similarity value using the determined values, which can be done in human mind or with the aid of pen and paper.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

	Regarding claim 11, wherein the type of the pooling layer is max pooling or average pooling, and the parameters are a filter size and an interval at which a filter is applied is a field of use and technological environment (MPEP 2106.05(h)), as it merely specifies the type of function it uses.
This judicial exception is not integrated into a practical application. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the specification does not disclose what constitutes a computer readable storage medium and under broadest reasonable interpretation the claim limitation can include both statutory and non-statutory elements. Therefore the claim is not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 1, 12, 13 are rejected under U.S.C. 103 over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1).

Regarding claim 1, Ashmore teaches an information processing method comprising: reading a layer structure and parameters of layers from each of models of two neural networks ([Ashmore, entire page 9], figure 4 “When two models are aligned, the elements of that model may be compared in a pair-wise manner to evaluate the similarity (or dissimilarity) between the two models … First, assuming the activation functions are antisymmetric, the output of any hidden unit may be negated if the weights into which it feeds are also negated … Second, any two hidden units, ua, and ub may be swapped if the corresponding weights and activation functions are also swapped”); and 
determining a degree of matching between the models of the two neural networks, by comparing layers, of the respective models of the two neural networks, that are configured as a graph-like form in respective hidden layers, comparing layers in order from an input layer, based on similarities between respective layers ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”, [Ashmore, page 11, Figure 4.2] shows the detailed algorithm for bipartite matching, [Ashmore, page 10, Figure 4.1] “This figure shows how FBA analyzes weights. The align network will be aligned to the target network. Neurons A and D are similar, however neuron D should be negated”, the neural network and its hidden layers can be represented as a graph structure, as shown in the Figure 4.1. The hidden nodes A, B, C, and D, E, F are all connected to the input and output nodes using edges, which is a graph structure).
Ashmore does not specifically teach using breadth first search or depth first search to search through the neural network.
Cruz-Albrecht teaches using breadth first search or depth first search to search through the neural network ([Cruz-Albrecht, column 18, line 58-67 – column 19, line 1-4] “In particular, the Manhattan distance between each pair of nodes of the neural network is determined during the relative distance ranking. The determined Manhattan distance is then used to rank the pair of nodes relative to distances of other pairs of nodes. The ranked node pairs are arranged in ascending order and routing is applied to determine a path between the nodes beginning with the lowest ranked node pair (i.e., nodes separated by a shortest Manhattan distance) and proceeding sequentially to a highest ranked pair of nodes (i.e., a node pair having a longest Manhattan distance). In some embodiments, a queue of the A* search algorithm is managed as a last-in-first-out (LIFO) manner for a partial path cost and the A* search algorithm may behave substantially similar to the depth-first search (DFS)” shows that the depth-first search can be used to search through the neural network structure).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore and Cruz-Albrecht to use the breadth-first search method of Cruz-Albrecht to implement the method of comparing two neural networks of Ashmore. The suggestion and/or motivation for doing so is to efficiently search through the each of the nodes of the neural network.
Claim 12 is an apparatus claim having similar limitation to method claim 1. Therefore, rejected using the same rationale as claim 1 above.
Regarding claim 13, Ashmore in view of Cruz-Albrecht teaches a computer readable storage medium storing a program, the program causes, upon being executed by one or more processors of a computer ([Ashmore, page 4, Background, entire page] “For example, it could not be used to distribution computation across a cluster of separate machines without a very high-speed link between them. It certainly would not suffice for allowing arbitrary machines connected to the Internet to participate in a distributed training effort …”, shows that the method is performed in a machine (computer). It is obvious that a computer have a computer readable storage medium that stores a program). Claim 13 is a computer readable storage medium claim having similar limitation to method claim 1. Therefore, rejected using the same rationale as claim 1 above.

Claim 2 is rejected under U.S.C. 103 over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), and further in view of Williams (US 5240009 A).

Regarding claim 2, Ashmore in view of Cruz-Albrecht teaches wherein the determining the degree of matching between the models of the two neural networks includes ([Ashmore, entire page 9] “When two models are aligned, the elements of that model may be compared in a pair-wise manner to evaluate the similarity (or dissimilarity) between the two models … First, assuming the activation functions are antisymmetric, the output of any hidden unit may be negated if the weights into which it feeds are also negated … Second, any two hidden units, ua, and ub may be swapped if the corresponding weights and activation functions are also swapped”), and comparing layers of the neural network ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”).
Ashmore in view of Cruz-Albrecht does not specifically teach teaches if types of data to be compared are different, setting a similarity to 0, and not performing comparison in data in a later stage than the data subjected to the comparison.
Williams teaches if types of data to be compared are different, setting a similarity to 0, and not performing comparison in data in a later stage than the data subjected to the comparison ([Williams, column 5, line 55-64] “Comparison of two complexes is accomplished in three stages. In the first stage the gain setting of the subject complex is checked against the gain setting of the standard complex. If they differ, the comparison process is terminated and the subject complex is assigned a score of zero. The second step adjusts the complexes so their major peaks are aligned. A peak-by-peak comparison is then done to produce a score which reflects the similarity of the subject complex to the standard complex”, [Williams, column 4, line 1-4] “Standard and subject complexes are also stored in the RAM 16 by the CPU 12. The CPU 12 performs the logical operations directed by the program code in the ROM 14”, complexes are information that stored as computer-readable form. The layers of the neural networks are also information that can be stored as computer-readable form, therefore it is obvious to apply the information comparison method to implement the layer comparison method of Ashmore).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, and Williams to use the setting similarity scores between multiple data and terminate the comparison process when the score is 0 of Williams to implement the information processing method of Ashmore and Cruz-Albrecht. The suggestion and/or motivation for doing so is to evaluate the similarity between two data more objectively.

Claim 3 and 6 are rejected under 35 U.S.C. 103 over over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), and further in view of Durdanovic (US 20170337472 A1).

Regarding claim 3, Ashmore teaches wherein the determining the degree of matching between the models of the two neural networks includes, when layers are compared ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”, it would be obvious to apply to any layer including a convolution layer, since convolutional layers and layers in Ashmore are both matrices). 
modifying parameters of weight filters to be compared to respective filter sizes ([Ashmore, page 10, Figure 4.1] “This figure shows how FBA analyzes weights. The align network will be aligned to the target network. Neurons A and D are similar, however neuron D should be negated. Neuron E is similar to neuron C, because of how close the weights are; likewise neuron F is similar to neuron B. FBA finds the optimal matching that minimizes the difference between the weights, but does not require matching weights to be identical. A plot of the weights feeding into hidden units is given, there it can be see that the similar neurons are grouped together. The final aligned network is shown in the bottom right”, shows how Ashmore changes the weight of the Align Network to match the Target Network). 
expressing the layers by vector sets by regarding parameters of each weight filter as one vector, setting a similarity between the vector sets of the respective convolutional layers to be compared as a similarity between the layers ([Ashmore, page 11, Figure 4.2] The pseudocode of Figure 4.2 contains ‘S:=set of weight-vectors that feed into l, for network X , R:=set of weight-vectors that feed into l, for network Y, plus the n negations of each weight in R’, which is a process of expressing the convolutional layers (weight) by vector sets (set of weight-vectors that feed into I). [Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful’ teaches the similarity measurement between layers).
Ashmore in view of Cruz-Albrecht does not specifically teach estimating a true filter size with respect to each weight filter of the convolutional layers. 
Durdanovic teaches estimating a true filter size with respect to each weight filter of the convolutional layers ([Durdanovic, 0018] “During training, the values of the weights 104 are multiplied by an attrition factor a that is less than 1 (e.g., a=0.9999). Thus, during each iteration of training, those weights which are not enhanced by the training process will eventually decrease in magnitude until they fall below a threshold. In this example, a column 106 has fallen below the threshold, representing weights which do not contribute to the accuracy of the output. This column 106 is pruned from the first array of weights 104”. [Durdanovic, 0019] “The first array of weights 104 provides its output to a layer of hidden neurons 108. The pruned column 106 corresponds to one filter 110 that is pruned from the layer of hidden neurons 108. The layer of hidden neurons 108 perform a computational function and provide an output to a second array of weights 112”, Durdanovic teaches pruning the weight matrix of the neural network and finding out pruned weight (true weight), [Durdanovic, Fig 1] shows the process of pruning 106 and 114 and giving the output rectangular matrix).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, and Durdanovic to use the pruning the weight matrix of Durdanovic to implement the information processing method of Ashmore and Cruz-Albrecht. The suggestion and/or motivation for doing so is to enhance the efficiency of the system, as pruning the weight of neural network reduces the amount of memory to store the network.
Neither Ashmore, Cruz-Albrecht, nor Durdanovic explicitly teaches comparing convolutional layers, but it would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention to substitute the layers of ‘comparing layers’ of Ashmore to convolutional layers. The modification would have been obvious because it would be obvious to apply the method of comparing layers to any layer including a convolution layer, since convolutional layers and layers in Ashmore are both matrices. The suggestion and/or motivation to do so is to compare two neural networks more objectively.

Regarding claim 6, Ashmore in view of Cruz-Albrecht, and further in view of Durdanovic teaches wherein the similarity between the vector sets is obtained by configuring a bipartite graph by obtaining pairs of feature vectors whose distance is a predetermined threshold value or less, and calculating a maximum number of matches by solving a maximum matching problem from the bipartite graph ([Ashmore, page 11, Figure 4.2: Forward Bipartite Alignment Pseudocode] The 9th line of the code contains ‘K:=maximum bipartite matching between S and R’, ‘if i is a matching negated weight in K then …’ teaches the threshold value. [Ashmore, page 11, second paragraph] “An unbiased approach for aligning neural networks requires finding the optimal bipartite matches between the nodes in the corresponding layers of a multilayer perceptron. Fortunately, bipartite matching can reduce to a graph cutting problem with efficient known solutions [28]. If swapping network units were the only function-invariant operation with neural networks, then bipartite matching algorithms would provide a straight-forward solution to identifying the best way to swap the nodes. Unfortunately, negation is also a function-invariant operation. FBA addresses this complication by including both positive and negated representations of the weights of the units in one of the neural networks, such that n units are matched against 2n units in the other network. When one of the negated points is found to be optimal for the bipartite matching, this indicates that the weights of that unit need to be negated. In figure 4.1 the weight-vectors are plotted in a graph, including the negation of each weight. The similar neurons are closer together, and these would be the matching weight vectors that bipartite matching would choose. The final aligned network can be seen in the bottom right of the figure”).

	Claim 7 is rejected under 35 U.S.C. 103 over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), in view of Durdanovic (US 20170337472 A1), and further in view of Chen (US 20060072679 A1).

Regarding claim 7, Ashmore in view of Cruz-Albrecht, in view of Durdanovic teaches the information processing method. 
Ashmore in view of Cruz-Albrecht, and further in view of Durdanovic does not teach wherein the similarity between the vector sets is obtained by quantizing respective feature vectors, and obtaining a similarity between quantization histograms.
Chen teaches wherein the similarity between the vector sets is obtained by quantizing respective feature vectors, and obtaining a similarity between quantization histograms ([Chen, 0024] Let the magnitude |r[n]| of the received object signal r[n], n=0, 1, . . . , N-1, be uniformly quantized by using the same quantization size d to establish the statistic histogram of |r[n]|'s distribution (probability versus quantized magnitudes). The statistic histogram can also be expressed with an object vector having the length L. As the conventional linear pattern recognition technique, comparing the object vector of the statistic histogram with the feature patterns of various modulation types that have been established in advance by off-line processing, the received object signal can be classified as belonging to the modulation type whose feature pattern is the most similar to the object vector. However, when the noise and interference severely distort the received signal, the recognition technique depending on a single feature for each modulation type may cause erroneous recognition and classification. This is especially true for the modulation recognition among certain levels of QAM signals).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, Durdanovic, and Chen to use the method of quantizing vectors and creating histogram to compare similarity of Chen to implement the method of comparing two neural networks of Ashmore, Cruz-Albrecht, and Durdanovic. The suggestion and/or motivation for doing so is to compare the similarity of layers easily.

	Claim 5 is rejected under 35 U.S.C. 103 over over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), and further in view of Yamamoto (US 20090077132 A1).

Regarding claim 5, Ashmore in view of Cruz-Albrecht teaches wherein the determining the degree of matching between the models of the two neural networks ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”). 
Ashmore in view of Cruz-Albrecht does not specifically teach when full-connected layers are compared, expressing each of the full-connected layers, by regarding weights of each of the full-connected layers as a feature vector, using a vector set, and setting a similarity between the vector sets of the respective full- connected layers to be compared as a similarity between the full-connected layers.
Yamamoto teaches when data are compared, expressing each of the data, by regarding weights of each of the full-connected layers as a feature vector, using a vector set, and setting a similarity between the vector sets of the respective data to be compared as a similarity between the data ([Yamamoto, 0128] “The similar user detecting unit 203 detects another user having a similar preference vector to that of a user to whom to recommend a musical piece by comparing the preference vector of each user, which preference vector is retained in the user history information DB 17. More specifically, the similar user detecting unit 203 normalizes preference vectors as an example of user preference information, calculates the weight of each layer for each user from the normalized preference vector of each user, calculates a degree of similarity indicating a degree of similarity of preferences between users from the weight of each layer and the preference vector, and detects a second user having similar preferences to those of a first user”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, and Yamamoto to use the method of using vector set to calculate the similarity between two data of Yamamoto to implement the method of comparing two neural networks of Ashmore and Cruz-Albrecht. The suggestion and/or motivation for doing so is to enhance the efficiency of computation, as vector sets are easier for computer to compute.
Neither Ashmore, Cruz-Albrecht, nor Yamamoto explicitly teaches comparing full-connected layers, but it would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention to use the data comparison method of Yamamoto to the method of comparing neural networks of Ashmore and Cruz-Albrecht. The modification would have been obvious because comparing layers from two different neural networks are the same as comparing two different data, as neural network layers are also a type of computer readable data. The suggestion and/or motivation to do so is to compare two neural networks more objectively.

Claim 8-10 are rejected under 35 U.S.C. 103 over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), and further in view of Szeto (US 20180018590 A1).

Regarding claim 8, Ashmore in view of Cruz-Albrecht teaches wherein the determining the degree of matching between the models of the two neural networks includes ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”).
Ashmore in view of Cruz-Albrecht does not specifically teach determining degree of matching when activation layers are compared, setting, if types of the activation layers are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity.
Szeto teaches determining degree of matching when data are compared, setting, if types of data are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity ([Szeto, 0094] “In the example shown, the comparison is represented by difference parameters 480 where the parameter-wise difference is presented. If the trained proxy model 470 were completely identical to trained actual model 440, difference parameters 480 would all be zero. However, considering that the trained proxy model 470 is built on proxy data 460, non-zero differences are expected. Therefore, the two trained models can be compared, at least in the example shown, by calculating similarity score 490 as a function of the values of actual model parameters (P.sub.a) 445 and proxy model parameters (P.sub.p) 475, wherein in similarity score 490, N corresponds to the number of parameters and i corresponds to the i.sup.th parameter”. [Szeto, 0045] “In some embodiments, the global model server 130 analyzes sets of proxy related information (including for example proxy data 260, proxy data distributions 362, proxy model parameters 475, other proxy related data combined with seeds, etc.) to determine whether the proxy related information from one of private data server 124 has the same shape and/or overall properties as the proxy related data from another private data server 124, prior to combining such information” discloses checking if the shape of two different data are the same which corresponds to the process of comparing data types).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, and Szeto to use the detailed method of calculating similarity between two data of Szeto to implement the method of comparing two neural networks of Ashmore and Cruz-Albrecht. The suggestion and/or motivation for doing so is to evaluate the similarity between two data more objectively.
Neither Ashmore, Cruz-Albrecht, nor Szeto explicitly teaches the condition of if the types of activation layers are the same, but it would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention to add the condition of ‘if the types of data are the same’ of Szeto to the method of comparing neural networks of Ashmore. The modification would have been obvious because comparing layers from two different neural networks are the same as comparing two different data, as neural network layers are also a type of computer readable data. The suggestion and/or motivation to do so is to compare two neural networks more objectively.

Regarding claim 9, Ashmore in view of Cruz-Albrecht, and further in view of Szeto teaches wherein the type of the activation layer is a linear connection, a sigmoid function, a hard sigmoid function, a tanh function (hyperbolic tangent function), a softsign function, a softplus function, or a ReLU (Rectified Linear Unit) ([Ashmore, page 9, second paragraph] “First, assuming the activation functions are antisymmetric, the output of any hidden unit may be negated if the weights into which it feeds are also negated. If the hidden unit has an activation function, a, which is antisymmetric about the input 0, then the output of this unit may be negated by adding åi 2a(0)wi to its bias, and negating all of the other incoming weights. In cases where a(0) = 0, such as tanh, the biases will not be changed”, teaches the activation function being tanh function. It does not have to teach the entire ReLu, softsign, sigmoid, because the claim uses ‘or’).

Regarding claim 10, Ashmore in view of Cruz-Albrecht, and further in view of Szeto teaches wherein the determining the degree of matching between the models of the two neural networks includes ([Ashmore, page 8, 3.2 Distance Metric, second paragraph] “Because the networks are made up of layers, we could calculate a distance for each pair of layers instead of just a single distance for the entire network. This would give us a distance tuple of the size of the number of layers in the networks. This tuple can be more difficult for a human to analyze but algorithms could analyze the distance to cluster similar networks together or examine the relationships. This tuple is helpful because of the nature of forward bipartite alignment. FBA aligns layer by layer, and analyzing the distance layer by layer can be helpful”).
Ashmore in view of Cruz-Albrecht does not specifically teach when data are compared, setting, if types of the data are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity. 
Szeto teaches when data are compared, setting, if types of the data are the same, and a distance between parameters is a predetermined threshold value or less, the distance as a similarity, and in other cases, 0 to the similarity ([Szeto, 0079] “The similarity between trained proxy model 270 and trained actual model 240 can be measured through various techniques by modeling engine 226 calculating model similarity score 280 as a function of proxy model parameters 275 and actual model parameters 245. The resulting model similarity score 280 is a representation of how similar the two models are, at least to within similarity criteria ... In some embodiments, similarity score 280 can be a single value (e.g., a difference in accuracy, sum of squared errors, etc.) that can then be compared to a threshold value. In other embodiments, similarity score 280 can be multivalued ... In embodiments where similarity score 280 does include multiple values, then the values within similarity score 280 can be compared to similarity criteria (i.e., multiple criterion). Techniques for measuring similarity score 280 are discussed further with respect to FIG. 4”, teaches comparing similarity value to threshold value, [Szeto, 0094] “In the example shown, the comparison is represented by difference parameters 480 where the parameter-wise difference is presented. If the trained proxy model 470 were completely identical to trained actual model 440, difference parameters 480 would all be zero. However, considering that the trained proxy model 470 is built on proxy data 460, non-zero differences are expected. Therefore, the two trained models can be compared, at least in the example shown, by calculating similarity score 490 as a function of the values of actual model parameters (P.sub.a) 445 and proxy model parameters (P.sub.p) 475, wherein in similarity score 490, N corresponds to the number of parameters and i corresponds to the i.sup.th parameter”. [Szeto, 0045] “In some embodiments, the global model server 130 analyzes sets of proxy related information (including for example proxy data 260, proxy data distributions 362, proxy model parameters 475, other proxy related data combined with seeds, etc.) to determine whether the proxy related information from one of private data server 124 has the same shape and/or overall properties as the proxy related data from another private data server 124, prior to combining such information” discloses checking if the shape of two different data are the same which corresponds to the process of comparing data types). 
Neither Ashmore, Cruz-Albrecht, nor Szeto explicitly teaches the condition of if types of pooling layers are the same and giving 0 to the similarity if two data types are different, but it would have been obvious to a person of ordinary skill in art before the effective filling date of the claimed invention to using similarity scoring method of Szeto to score the similarity between two different models of Ashmore and Cruz-Albrecht. The modification would have been obvious because giving 0 to the similarity score if two data are completely different is a common practice in the art of scoring similarity between two data. The suggestion and/or motivation to do so is to compare the similarity of two different models more objectively.

	Claim 11 is rejected under 35 U.S.C. 103 over Ashmore (Ashmore, 2015, “Evaluating the Intrinsic Similarity between Neural Networks”) in view of Cruz-Albrecht (US 8977578 B1), in view of Szeto (US 20180018590 A1), and further in view of Ravindran (US 20160259994 A1).

Regarding claim 11, Ashmore in view of Cruz-Albrecht, and further in view of Szeto teaches the information processing method of claim 10. 
Ashmore in view of Cruz-Albrecht, and further in view of Szeto does not specifically teach wherein the type of the pooling layer is max pooling or average pooling, and the parameters are a filter size and an interval at which a filter is applied.
Ravindran teaches wherein the type of the pooling layer is max pooling or average pooling, and the parameters are a filter size and an interval at which a filter is applied ([Ravindran, 0036] "Examples of convolution and sub-sampling parameters include the convolutional filter size, the number of feature maps at each layer of the CNN, and the sub-sampling pool size. The convolutional filter size parameter is the size of the filters in a convolution layer. According to an example, the range for the convolutional filter size parameter is between 2×2 pixels and 114×114 pixels. The number of feature maps parameter is the number of feature maps output from the number of filters or kernels in each convolution layer. According to an example, the range for the number of feature maps parameter is between 60 to 512 feature maps for a first convolutional layer. The sub-sampling pool size parameter is the size of a square patch of pixels in the image down-sampled into, and replaced by, one pixel after the operation via maximum pooling, which sets the value of the resulting pixel as the maximum value of the pixels in the initial square patch of pixels. According to an example, the range of values for the sub-sampling pool size parameter includes, but is not limited to, a range between 2×2 to 4×4. The parameters of the network of the convolutional layers are selected to reduce the input image size into 1×1 pixel value on the output of the final convolutional layer according to an example”).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Ashmore, Cruz-Albrecht, Szeto, and Ravindran to use the type of the pooling layer is max pooling or average pooling layer of Ravindran to implement the method of comparing two neural networks of Ashmore, Cruz-Albrecht, and Szeto. The suggestion and/or motivation for doing so is to reduce the number of parameters of the weight of the neural network, which can be done by using max-pooling layer.

Allowable Subject Matter
Claim 4 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims upon overcoming the 101 rejection.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure.
Regarding Comparing similarities between two or more neural networks.
Ashmore, 2015, “A Method for Finding Similarity between Multi-Layer Perceptrons by Forward Bipartite Alignment” 
US 20170004399 A1
US 20160041982 A1
US 5166539 A
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can
normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are
unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax
phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application
Information Retrieval (PAIR) system. Status information for published applications may be obtained
from either Private PAIR or Public PAIR. Status information for unpublished applications is available
through Private PAIR only. For more information about the PAIR system, see http://pair-
direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free).

/JUN KWON/
Examiner, Art Unit 2127
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127