DETAILED ACTION
Response to Amendment
The amendment was received 7/30/21. Claims 1-7 and 9-16 are pending.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 







(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 




Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.











This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: A.	“testing…by using…to determine” in claim 1, lines 3,4 because “to determine” is 
modified by the sufficient acts of said “testing” and “by using”;
B.	“performing spectral clustering analysis …to obtain” in claim 4 (similarly including claim 2) because “to obtain” is modified by the sufficient acts of said “clustering” and “analysis”;
C.	“testing…by using…to determine” in claim 9, lines 8,9 because “to determine” is modified by the sufficient acts of said “testing” and “by using”;
D.	“testing…by using…to determine” in claim 10, lines 5,6 because “to determine” is modified by the sufficient acts of said “testing” and “by using”; and
E.	“performing clustering analysis…to determine” in claim 11, line 6 because “to determine” is modified by sufficient acts of said “clustering” and “analysis”;
F.	“performing spectral clustering analysis…to obtain” in claim 13, line 3 because “to obtain” is modified by sufficient acts of said “clustering” and “analysis”.
Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.


If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function.
Accordingly the following definitions are “taken” via MPEP 2111.01 III. "PLAIN MEANING" REFERS TO THE ORDINARY AND CUSTOMARY MEANING GIVEN TO THE TERM BY THOSE OF ORDINARY SKILL IN THE ART, 3rd paragraph, emphasis added:
“It is also appropriate to look to how the claim term is used in the prior art, which includes prior art patents, published applications, trade publications, and dictionaries. Any meaning of a claim term taken from the prior art must be consistent with the use of the claim term in the specification and drawings. Moreover , when the specification is clear about the scope and content of a claim term, there is no need to turn to extrinsic evidence for claim interpretation. 3M Innovative Props. Co. v. Tredegar Corp., 725 F.3d 1315, 1326-28, 107 USPQ2d 1717, 1726-27 (Fed. Cir. 2013) (holding that "continuous microtextured skin layer over substantially the entire laminate" was clearly defined in the written description, and therefore, there was no need to turn to extrinsic evidence to construe the claim).”

Accordingly:
The claimed “method” (as in “An image feature acquisition method” in clam 1, line 1) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, wherein definition 2 is “taken”:
BRITISH DICTIONARY DEFINITIONS FOR METHOD (1 OF 2)
noun
1	a way of proceeding or doing something, esp a systematic or regular one
2	orderliness of thought, action, etc

The claimed “training” (as in “training a siamese network-based classification model” in claim 1, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, wherein definition 2 is “taken”:
BRITISH DICTIONARY DEFINITIONS FOR TRAIN
train
verb
1	(tr) to guide or teach (to do something), as by subjecting to various exercises or experiences: to train a man to fight
2	(tr) to control or guide towards a specific goal: to train a plant up a wall

	The claimed “model” (as in “training a siamese network-based classification model” in claim 1, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com:
SCIENTIFIC DEFINITIONS FOR MODEL
model
A systematic description of an object or phenomenon that shares important characteristics with the object or phenomenon. Scientific models can be material, visual, mathematical, or computational and are often used in the construction of scientific theories. See also hypothesis theory.

The claimed “obtain” (as in “performing spectral clustering analysis on the confusion matrix to obtain a plurality of clusters” as in claim 4, lines 3,4: not interpreted under 35 USC 112(f)) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, wherein “get” is “taken”:
obtain
verb (used with object)
1	to come into possession of; get, acquire, or procure, as through an effort or by a request:
to obtain permission; to obtain a better income.

wherein “get” is defined:

get1
verb (used with object), got or (Archaic) gat;got or got·ten;get·ting.
7	to acquire a mental grasp or command of; learn:
to get a lesson.
The claimed “operation” (as in “the processor executes the computer programs to implement operations comprising:” in claim 9, lines 5,6) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, wherein “Computer Science An action resulting from a single instruction” is “taken”:
SCIENTIFIC DEFINITIONS FOR OPERATION
operation
Medicine A surgical procedure for remedying an injury, ailment, defect, or dysfunction.
Mathematics A process or action, such as addition, substitution, transposition, or differentiation, performed in a specified sequence and in accordance with specific rules.
A logical operation.
Computer Science An action resulting from a single instruction.

The claimed “A non-transitory computer readable storage medium” in claim 10 is interpreted, as one of skill in the art would, in light of applicant’s disclosure via:
“[0075] Through the above description of the implementation, it is clear to persons skilled in the art that in the foregoing implementations may be accomplished through software plus a necessary general-purpose hardware platform or may be certainly implemented through hardware. Based on this, the technical solutions of the present disclosure essentially or the part that contributes to the prior art may be embodied in the form of a software product. The computer software product may be stored in a computer readable storage medium such as a read-only memory (ROM)/random access memory (RAM), a magnetic disk or an optical disc) and contain several instructions adapted to instruct a computer device (for example, a personal computer, a server, or a network device) to perform the method according to the embodiments or some of the embodiments.”













Response to Arguments
Applicant’s arguments, see remarks, page 7, filed 7/30/21, with respect to the claim objection and 35 USC 101 have been fully considered and are persuasive:
The claim objection of claims 11-16 has been withdrawn. 
The 35 USC 101 rejection of claim 10 has been withdrawn. 
Applicant’s arguments, see remarks, page 9, emphasis added:
“In Faktor, if two images share at least one large region, the two images are "partial similar". Please continue to see Section 4 of Faktor, it can be seen that the detection of shared regions is not associated with the siamese network-based classification model. However, as recited in claim 1, nonsimilar image pairs are determined by testing classification results from the siamese network-based classification model with verification images. Thus, the disclosure of Faktor neither involve to "training a siamese network-based classification model (feature 1)" nor "determining nonsimilar image pairs by testing classification results from the siamese network- based classification model by using verification images (feature 2)" recited in claim 1.”

, filed 7/30/21, with respect to the rejection(s) of claim(s) 1-4 and 7 under 35 USC 102 and claims 9/16 and 10 under 35 USC 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  









However, upon further consideration, a new ground(s) of rejection is made in view of 35 USC 103 in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) that teaches a twin CNN as shown in fig. 2 and a rider with horse in figure 1: “Object proposals” comprising a group of patches that corresponds to the “shared regions”, comprising groups of “patches”, corresponding to Faktor’s “Figs. 7 and 10” of dancing, horse and rider via Faktor, page 1094:
“Our randomized search algorithm is inspired by “PatchMatch” [1], [2], but searches for similar regions (as opposed to similar patches or descriptors). We represent an image by computing N patches (or descriptors) at some dense image grid and consider a region in the image to be an ensemble of patches (or descriptors) along with their relative positions within the region. We show that when randomly sampling descriptors across a pair of images, and propagating good matches between neighboring descriptors, large shared regions can be detected in linear time OðNÞ. In fact, the larger the region, the faster it will be found, and with higher probability. We refer to this collaboration between descriptors as exploiting the “wisdom of crowds of pixels” for efficient detection of shared regions between two images. Section 4 explains the randomized region search and provides analysis of its complexity. Examples of detected shared regions can be found in Figs. 7 and 10.”

Thus it can be seen that the detection of shared regions of Faktor is associated with the siamese network-based classification model in view of Isola: Isola’s siamese network-based classification model of figure 2 is able to spot the horse and rider (as shown in Isola’s fig. 1:“Object proposals”: bottom image) and the dancers via patches.
Thus, the combination of cited Faktor (Clustering by Composition-Unsupervised Discovery of Image Categories) with said Isola (Learning Visual Groups from Co-Occurrences in Space and Time) is relied upon to teach the claimed “siamese network-based classification model” in claim 1 when considered as a whole under 35 USC 103. 



Accordingly, Isola teaches:
a)	training (understood to one of ordinary skill in the art given that fig. 2, on page 3, is a convolutional neural network, CNN) a siamese network-based classification (via said fig. 2: “Co-occurrence classifier”) model (feature 1) (or equations (1) or (2) in pages 3,4 each of which is a model of affinity that is made via the verb-form of model); or 
b)	determining nonsimilar image pairs (represented in figs. 1 and 2 as “A” and “B” each being representative of grass in the horse and rider image and turkey image wherein the grass “A” and the adjacent grass “B” is nonsimilar or “distinct” to the horse and rider and turkey each to be comprised of patches as shown in fig. 1:bottom-left image of horse and rider patches and in fig. 4:center:three turkey patch groups and also represented in fig. 1: “Movie segmentation”: each “A …B” pair corresponding to “features far apart in time are dissimilar” to other AB pairs in the movie) by testing (via “test”) classification results (output of fig. 2: “Co-occurrence classifier”) from the siamese network-based classification model (said made model via fig. 2: “Co-occurrence classifier”) by using verification images (feature 2) (or “the Pascal VOC 2012 validation set” via:
page 1, section 1 INTRODUCTION
	“Here we probe the former hypothesis. Because the physical world is highly structured, adjacent locations are usually semantically related, whereas far apart locations are more often semantically distinct. By modeling spatial and temporal dependencies, we may therefore learn something about semantic relatedness.”; and







pages 2,3:
“A recent line of work in representation learning has taken a similar tack, training discriminative models to predict one aspect of raw sensory data from another. This work may be termed selfsupervised learning and has a number of flavors. The common theme is exploiting spatial and/or temporal structure as supervisory signals. Mobahi et al. (2009) learn a feature embedding such that features adjacent in time are similar and features far apart in time are dissimilar. Srivastava et al. (2015) predict future frames in a video, and rely on strict temporal ordering; extension to spatial or unordered data is unclear. Wang & Gupta (2015) use a siamese triplet loss to learn a representation that can track patches through a video. They rely on training input from a separate tracking
algorithm. Agrawal et al. (2015) as well as Jayaraman & Grauman (2015) regress on egomotion signals to learn a representation. Finally, Doersch et al. (2015) learn features by predicting the relative orientation between patches within an image.”

page 5:
“To evaluate performance, we sample 10,000 patches from the Pascal VOC 2012 validation set, 50% with C = 1 and 50% with C = 0. In Table 1 we measure the Average Precision of using several affinity measures as a binary classifier of either C or Q. In this case, we defined Q to indicate whether or not the center pixel of the two patches lies on the same labeled object instance. To test Q independently from C we create the Q test set by only sampling from patch pairs for which C = 1 (so the net cannot do well at predicting Q simply by doing well at predicting C). Our network performs well relative to the baseline affinity metrics, although color histogram similarity does reach a similar performance on predicting Q.”












Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Regarding inquiry 4, see Suggestions regarding claim 2.
Claims 1-4 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time).
Regarding claim 1, Faktor teaches an image feature acquisition method, comprising: 



training (via “guiding it where to sample in the next iteration”, cited below: page 1098, right column, last paragraph) a siamese network-based classification model (or “patches (or descriptors)” comprising “by describing an image”, cited below: pages 1093,1094,  resulting in “descriptors for classification”, page 1096, left/right column, represented in fig. 3: “Guided sampling”) by using preset classes (or “difficult classes”) of training (said via “guiding it where to sample in the next iteration”) images (or “189 images” represented in fig. 3: “Image collection”); 
testing (via “We tested”) classification (said via “by describing an image” resulting in “descriptors for classification”) results (as shown in fig. 4) from the siamese network-based classification model (said or “patches (or descriptors)” comprising “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) by using verification (via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) to determine nonsimilar (or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (via “a pair of images”); 
determining similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images”) based on the training (said via “guiding it where to sample in the next iteration”) images (said or “189 images” represented in fig. 3: “Image collection”); 




optimizing (via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) the siamese network-based classification model (said or “patches (or descriptors)” comprising “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) based on the similar (said or “ ‘partially similar’ ”) image pairs (via “a pair of images” or “pairs”) and the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (via “a pair of images” or “pairs”); and 
acquiring (said via “The algorithm”, algorithm line 7: “matrix B” at fig. 3: “Iteration 2,…T” based on fig. 3: “Iteration 1”) image features (or “local features...across the images” corresponding to a “distinguishing feature” via said “The algorithm”, algorithm line 7: “matrix B” comprising said “descriptors” “dk” as shown in equation “Bij” in page 1099, right column, 3rd bullet comprising said a “distinguishing feature”) by using the optimized (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max” at said fig. 3: “iteration 1”) classification model (said or “patches (or descriptors)” comprising “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling” via:
page 1092, right column:
“The first family of approaches is based on computing pairwise affinities between images. An example for this is the Pyramid Match Kernel of [7], which measures similarity between images according to the subset of matching local features which is discovered across the images. Other examples of commonly used pairwise affinities can be found in the comparison made by [20]. These affinities are typically based on a global “Bag of Words” representation of the images.”;







pages 1093,1094:
“1. Image affinities by composition: Our image affinities are based on “Similarity by Composition” [3]. The notion of composition is illustrated in Fig. 1. The Ballet image I0 is composed of a few large (irregularly shaped) regions from the ballet images I1 and I2. This induces strong affinities between I0 and I1; I2. The larger and more statistically significant those regions are (i.e., have low chance of occurring at random), the stronger the affinities. The ballet image I0 could probably be composed of Yoga images as well.
However, while the composition of I0 from other ballet images is very simple (a ‘toddler puzzle’ with few large pieces), the composition of I0 from yoga images is more complicated (a complex ‘adult puzzle’ with many tiny pieces), resulting in low affinities. These affinities are quantified in Section 3 in terms of the “number of bits saved” by
describing an image using the composition, as opposed to generating it ‘from scratch’ at random. To obtain reliable clustering, each image should have ‘good compositions’
from multiple images in its cluster, resulting in high affinity to many images in the cluster. Fig. 1 illustrates two different ‘good compositions’ of I0.”;

page 1094, left column, 4th paragraph:
“2. Randomized detection of shared regions: When describing our “affinity by composition”, we assumed the shared regions are known, but in practice these shared regions have to be automatically detected between the different images. However, since the regions can be of arbitrary size and shape, the region detection is in principle a hard problem even between a pair of images (let alone in a large image collection).  Therefore, we propose a randomized search algorithm which ensures that shared regions between two images will be detected efficiently with a high probability.
Our randomized search algorithm is inspired by “PatchMatch” [1], [2], but searches for similar regions (as opposed to similar patches or descriptors). We represent an image by computing N patches (or descriptors) at some dense image grid and consider a region in the image to be an ensemble of patches (or descriptors) along with their relative positions within the region. We show that when randomly sampling descriptors across a pair of images, and propagating good matches between neighboring descriptors, large shared regions can be detected in linear time OðNÞ. In fact, the larger the region, the faster it will be found, and with higher probability. We refer to this collaboration between descriptors as exploiting the “wisdom of crowds of pixels” for efficient detection of shared regions between two images. Section 4 explains the randomized region search and provides analysis of its complexity. Examples of detected shared regions can be found in Figs. 7 and 10.”;

page 1094, right column:
“3. Efficient “collaborative” multi-image composition: Clustering a collection of M images, should in principle require computing “affinity by composition” between all pairs of images—i.e. a complexity of OðNM2Þ, where N is the number of densely sampled patches (or descriptors) in each image. However, we show that when all the images in the collection are composed simultaneously from each other, they can collaborate to iteratively generate with very high probability the most statistically significant compositions in the image collection. Moreover this can be achieved in runtime almost linear in the size of the collection (without having to go over all the image pairs).”;
page 1096, left/right column:
“Fig. 5 displays logpðdjH0Þ for a few images of the Ballet/Yoga, Animals and PASCAL data sets. Red marks descriptors (HOG of size 15X15) with high error DdðH0Þ, i.e., high statistical significance. Image regions R containing many such descriptors have high statistical significance (low pðRjH0Þ). Statistically significant regions in Fig. 5a appear to coincide with body gestures that are unique and informative to the separation between Ballet and Yoga. Recurrence of such regions across images will induce strong and reliable affinities for clustering. Observe also that the long horizontal edges (between the ground and sky in the Yoga image, or between the floor and wall in the Ballet images) are not statistically significant, since they are composed of short horizontal edges which occur abundantly in many images. Similarly, statistically significant regions in Figs. 5b and 5c coincide with parts of the animals/objects that are unique and informative for their separation (e.g., the Monkey’s face and hands, the Elk’s horns, the bicycle’s wheels, etc.). This is similar to the observation of [4] that the most informative descriptors for classification tend to have the highest quantization error.”

wherein “descriptors” is defined via Dictionary.com:
descriptor
noun
1	a significant word or phrase used to categorize or describe text or other material, especially when indexing or in an information retrieval system.
2	Computers. a data item that stores the attributes of some other datum:
a task descriptor.

wherein “attributes” is defined:
attribute
noun
5	something attributed as belonging to a person, thing, group, etc.; a quality, character, characteristic, or property:
Sensitivity is one of his attributes.

wherein “characteristic” is defined:
noun
2	a distinguishing feature or quality:
Generosity is his chief characteristic.;

page 1098:
“4.2 Shared Regions within an Image Collection 
We now consider the case of detecting a shared region between a query image and at least one other image in a large collection of M images. For simplicity, let us first
examine the case where all the images in the collection are “partially similar” to the query image. We say that two images are “partial similar” if they share at least one large region (say, at least 10 percent of the image size). The shared regions Ri between the query image and each image Ii in the collection may be possibly different (Ri 6¼ Rj).”;
page 1098, right column, last paragraph:
“In a nut-shell, our algorithm starts with uniform random sampling across the entire collection. The connections created between images (via detected shared regions) induce affinities between images (see Section 3). At each iteration, the sampling density distribution of each image is re-estimated according to ‘suggestions’ made by other images (guiding it where to sample in the next iteration).”;

page 1100:
“6.1 Experiments on Benchmark Evaluation Data Sets 
We used existing benchmark data sets (Caltech, ETHZshape) to compare results against [11], [12], [13] using their experimental setting and measures. Results are reported in Table 1. The four data sets generated by [11] consist of difficult classes from Caltech-101 with non-rigid objects and cluttery background (such as leopards and hedgehogs), from four classes (189 images) up to 20 classes (1,230 images). Example images are shown in Fig. 8. The ETHZ-shape data sets consists of five classes: Applelogos, Bottles, Giraffes, Mugs and Swans. For the ETHZ data set, we followed the experimental setting of [12] (which crops the images so that the objects are 25 percent of the image size). For both Benchmarks, our algorithm obtains state-ofthe-art results (see Table 1). Note that for the case of 10 and 20 Caltech classes, our algorithm obtains more than 30 percent relative improvement over current state-of-the-art.”;

page 1101:
	“We tested our algorithm on this subset, clustering it to four clusters, and obtained a mean purity of 68:5 percent. In this case, restricting the search range did not yield better results since the object locations are scattered across the entire image. Fig. 11 shows example images which were clustered correctly along with example images which were mis-clustered. Notice that mis-clustered images can be conceptually ‘confusing’, like a horse with a carriage (which is confused for a car or bicycle due to the wheels). Fig. 10 show examples of shared regions detected by our algorithm
in this PASCAL subset.”; and

page 1102, left column, 2nd full paragraph:
“Precision-Recall of our affinity matrix: Finally, to measure the quality of the affinity matrix generated by our unsupervised algorithm, we conduct the following experiment. For each image we compute its resulting average affinity to the images within each of the classes (using the ground truth labels of the other images). We define the following classification confidence for each image Ii per class c : scoreði; cÞ ¼ Sj2c;j6¼iAði;jÞ Sj6¼iAði;jÞ , where A denotes our affinity matrix. Namely, scoreði; cÞ is the affinity of Ii to class c divided by the total affinity of Ii to all other images. We then compute precision-recall curves using the scores of each of the classes. For a given class and a given score threshold, the precision measures the percentage of class images among all the images which passed the threshold, and the recall counts
the percentage of these class images with respect to the total number of class images. We then compared our precisionrecall curves to that obtained using the affinities of the spatial pyramid match kernel.”).  

	Thus, Faktor does not teach the claimed “siamese network-based classification model”. Accordingly, Isola teaches the claimed:
siamese network (via “a Convolutional Neural Net (CNN) with a Siamese-style architecture (Figure 2, Chopra et al. (2005))”)-based classification (via fig. 2: Co-occurrence classifier”: “our co-occurrence classifier”) model (via “To model w(A, B)” via:
“3.1 PREDICTING CO-OCCURRENCES WITH A CNN 

To model w(A, B) we use a Convolutional Neural Net (CNN) with a Siamese-style architecture (Figure 2, Chopra et al. (2005)), which we implement in Caffe (Jia et al. (2014)). The network has two convolutional branches, one to process A and the other to process B, with shared weights. These branches can be regarded as feature extractors. The features are then concatenated and fed to a set of fully connected layers that compare the features and try to predict C. We use a logistic loss over C and train all models with stochastic gradient descent. Our objective can be expressed as 

E(A, B, C; θ) = −1 N X N 1 Ci log(σ(f(Ai , Bi ; θ)) + (1 − Ci) log(1 − σ(f(Ai , Bi ; θ)) (2) 

where θ are the net parameters we optimize over (weights and biases), N is the number of training examples, σ is the logistic function, and f is a neural net. For each of our experiments, N = 500, 000 training examples, 50% of which are positive (C = 1) and 50% negative (C = 0). We examine three domains: 1) learning to group patches based on their spatial adjacency in images, 2) learning to group video frames based on their temporal adjacency in movies, and 3) learning to group photos based on their geospatial proximity. 

Each task corresponds to a different choice of A, B, C, and Q. In each case, we analyze performance at predicting C and at predicting Q, comparing our CNN to baseline grouping cues. Each baseline corresponds to a measure of the similarity between the primitives. Similarity measures like these are commonly used in visual grouping algorithms Arbelaez et al. (2011); Faktor & Irani (2012). Full results of this analysis are given in Table 1. In all cases, our co-occurrence classifier matches or outperforms the baselines.”).

	


Thus, one of skill in the art of image affinities as taught by both references can modify Faktor’s said “by describing an image”, cited: pages 1093,1094, resulting in “descriptors for classification”, page 1096, left/right column, represented in fig. 3: “Guided sampling” with Isola’s teaching of said “To model w(A, B)” by:
a)	installing the Siamese CNN between Faktor’s fig. 3: “Iteration 1” and “Iteration 2,…,T”, comprising guided sampling, and “Spare set of meaningful affinities”;
b)	training the CNN:
b1)	modeling Faktor’s affinities as shown in Isola’s optimization modeling equation “(2)”, cited above, wherein “E(A, B, C; θ)” is the modeled affinity, based on the iterations 1 thru T;
c)	outputting a group of affinity patches (such as being representative of a horse with rider or turkey) via the trained Siamese CNN into Faktor’s fig. 3: “Sparse set of meaningful affinities” to find shared regions: represented: Faktor, fig. 3: “Update sampling distribution ‘Wisdom of crowds of images’ ”;
d)	inputting the shared regions back into the CNN to model more affinities; and
e)	recognizing that the modification is predicable or looked forward to because the modification “outperforms the baselines”, Isola, cited above, corresponding to Isola’s Table 1 in page 4: “Affinity measure” wherein “Co-occurrence classifier” is first in patches, frames and photos (as shown in Isola’s fig. 3, page 5: “Patches”; “Frames”; and “Photos”) and then “Color histogram” is second in patches (as shown in Isola’s fig. 3, fig. 5: “Patches”; “Frames”; and “Photos”).
	

Regarding claim 2, Faktor as combined teaches the method according to claim 1, wherein testing classification results from the siamese network-based classification model (via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”, as modified via the combination) by using verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) to determine nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs”) comprises: 
classifying the verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) by using the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”, as modified via the combination) to obtain a confusion matrix (or “a confusion matrix”); 
performing clustering (obtaining “clustering results”) analysis (in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1: page 1096) on the confusion matrix (said or “a confusion matrix”) to determine confusable classes (or “category”); and 





constructing (via fig. 2: “Sparse set of meaningful affinities” in the context of pairs as indicated by the dashed arrows in said fig. 2: “Sparse set of meaningful affinities”)  the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs”) based on verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) that belong to the confusable classes (said or “category” via pages 1101,1102:
“To understand the sources for confusion in our clustering results, we computed a confusion matrix of the generated clusters (see Fig. 12a). The different values in each row represent the distribution of images within that cluster. For example, the car cluster contains 72 percent cars, 8 percent bicycles, 17 percent horses and 3 percent chairs. The identity of each cluster was determined by the category which got the most images in the cluster. Ideally, we would like the values on the diagonal to be 100 percent and the offdiagonal values to be 0 percent.”).  













Regarding claim 3, Faktor as combined teaches the method according to claim 2, wherein classifying the verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) by using the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) to obtain a confusion matrix (said or “a confusion matrix”) comprises:
classifying the verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) by using the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) to obtain a predicted class (said or “category”) of each of the verification (said via “ground truth labels of the other images”)images (said or “189 images” represented in fig. 3: “Image collection”); and 
constructing (said via fig. 2: “Sparse set of meaningful affinities” in the context of pairs as indicated by the dashed arrows in said fig. 2: “Sparse set of meaningful affinities”) the confusion matrix (said or “a confusion matrix”) according to a genuine class (said or “category”) and the predicted class (said or “category”) of each of the verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”), 




wherein for each of rows (said in terms of “each row”) in the confusion matrix (said or “a confusion matrix”), a value of each column in the row (said in terms of “each row”) is a quantity of verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) which are in a class corresponding to the row (said in terms of “each row”) and classified into different classes (said or “category”).  
Regarding claim 4 (not interpreted under 35 USC 112(f)), Faktor as combined teaches the method according to claim 2, wherein performing clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix (said or “a confusion matrix”) to determine confusable classes (said or “category”) comprises: 
performing spectral (as indicated in fig. 4:green trees and fig. 9:red car) clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix (said or “a confusion matrix”) to obtain (comprising “to acquire a mental grasp” via fig. 12) a plurality of clusters (said via fig. 3: “Image Clusters” resulting in two underperforming bicycle and horse clusters), 
wherein each of the plurality of the clusters (said via fig. 3: “Image Clusters”) comprises at least one class (said or “category”); and
determining classes (via “the car cluster contains 72 percent cars, 8 percent bicycles, 17 percent horses and 3 percent chairs” cited in the rejection of claim 2) in a cluster (said via fig. 3: “Image Clusters”) that comprises at least two classes as the confusable classes (said or “category”).  

Regarding claim 7, Faktor as combined teaches the method according to claim 1, wherein acquiring image features (said or said or “local features...across the images” corresponding to a “distinguishing feature” via said “The algorithm”, algorithm line 7: “matrix B” comprising said “descriptors” “dk” as shown in equation “Bij” in page 1099, right column, 3rd bullet comprising said a “distinguishing feature”) by using the optimized classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) comprises: 
using images to be processed as an input (represented in fig. 3 as arrows) of the optimized classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”), 
acquiring an output (represented in fig. 3 as arrows) of a layer (with “overlap”) with feature expressiveness (or meaningfulness via fig. 3: “Sparse set of meaningful affinities”) in the optimized classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”), and 
using the output (said represented in fig. 3 as arrows) as image features (said or said or “local features...across the images” corresponding to a “distinguishing feature” via said “The algorithm”, algorithm line 7: “matrix B” comprising said “descriptors” “dk” as shown in equation “Bij” in page 1099, right column, 3rd bullet comprising said a “distinguishing feature”) of the images to be processed (via:
page 1098:
“ Claim 3 (Shared regions within an image collection). Let I0 be a query image, and let I1 ... ; IM be images of size N which are “partially similar” to I0. Let R1; ... ; RM be regions of size jRij  aN such that Ri is shared by I0 and Ii (the regions Ri may overlap in I0). Using S ¼ 1 a logð1 d Þ samples per descriptor in I0, distributed randomly across I1; ::; IM, guarantees with probability p  ð1 dÞ to detect at least one of the regions Ri.”).  

Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied above in claims 1-4 and 7 further in view of Balntas et al. (BOLD - Binary online learned descriptor for efficient image matching).
Regarding claim 5, Faktor as combined teaches the method according to claim 1, wherein optimizing (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) based on the similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) and the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs”) comprises: 









optimizing (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling” as modified via the combination) based on inter (via “strong intra-cluster connections, and very few inter-cluster connections”)-class (or “the same semantic category”) variance (or “variability”) maximization (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) and intra (said via “strong intra-cluster connections, and very few inter-cluster connections”)-class (said or “the same semantic category”) variance (said or “variability”) minimization (via “the lower matching error…so far”) and by using the similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) and the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs” via:
page 1092, left/right column:
“In this work, we deal with the problem of unsupervised discovery of visual categories within an image collection. The goal here is to group the images into meaningful clusters of images which belong to the same semantic category. Existing work on this problem can be broadly classified to two main families of approaches.”;

page 1093, description of fig. 2:
“Fig. 2. Clustering Results on our Ballet-Yoga data set. This data set contains 20 Ballet and 20 Yoga images (all shown here). Images assigned to the wrong cluster are marked in red. We obtain mean purity of 92:5 percent (37 out of 40 images are correctly clustered). Note there seems to be no single (nor even few) ‘common model(s)’ (e.g., common shapes or segments) shared by all images of the same category. Therefore, methods for unsupervised ‘learning’ of a shared ‘cluster model’ will most likely fail (not only due to the large variability within each category, but also due to the small number of images per category).”; 





page 1097, left/right column:
“(ii) Repeat several times:
a.	Propagation: Each descriptor chooses between its best match so far, and the match proposed by its spatial neighbors (with appropriate shift)—whichever has the lower matching error. For example, each descriptor suggests to its neighbor on the right the location which is just on the right from the location of its own match. The propagation through the entire image is achieved quickly via two image sweeps (once from top down, and once from bottom up). The complexity of this step is OðNÞ.
b.	Local search: Each descriptor d 2 I1 randomly samples h (typically a small number) locations in a small neighborhood around its current best match so far, and checks if one of the new locations improves its best match. This allows the regions to grow in a nonrigid fashion. The complexity of this step is OðNÞ.”; and

page 1099:
“Note that N-Cut algorithm (and other graph partitioning algorithms) implicitly rely on two assumptions: (i) that there are enough strong affinities within each cluster, and
(ii) that the affinity matrix is relatively sparse (with the hope that there are not too many connections across clusters). The sparsity assumption is important both for  computational reasons, as well as to guarantee the quality of the clustering. This is often obtained by sparsifying the affinity matrix (e.g., by keeping only the top 10 log10M values in each row [9]). The advantage of our algorithm is that it implicitly achieves both conditions via the ‘scholarly’ multiimage collaborative search. The ‘suggestions’ made by images to each other quickly generate (within a few iterations) strong intra-cluster connections, and very few inter-cluster connections.”).  










	Thus, Faktor as combined does not teach, as indicated in bold above, claim 5 as a whole. Accordingly, Balntas teaches claim 5 of:
optimizing the siamese network-based classification model (or “optimize a binary descriptor”) based on inter-class variance maximization and intra-class variance minimization (via “maximize the inter-class distances and then a subset is selected online for each patch to minimize the intra-class distances”) and by using the similar image pairs (via “patches…similar” comprising “pairs” as shown in fig. 1: “query patch”) and the nonsimilar image pairs (via “patches…dissimilar” comprising “pairs” as show in fig. 1: “query patch” via:
page 2368, left column, 1st full paragraph:
“In this paper we propose an approach which combines the advantages of efficient binary descriptors with the improved performance of learning-based descriptors. We demonstrate that there is no single set of measurements that
is globally optimal for all patches in a dataset and significant improvement can be gained by adapting the binary tests to the content of each patch. The measurements are first designed to maximize the inter-class distances and then a subset is selected online for each patch to minimize the intra-class distances. This is done efficiently in such a way that the extraction time is comparable to other binary descriptors. The proposed online selection of discriminative binary tests can be applied to other techniques such as decision trees or ferns. Nearest neighbour matching of descriptors is also efficient by calculating a modified Hamming distance. We evaluate the proposed descriptor on different benchmarks and demonstrate performance that matches that of SIFT, with computational efficiency that matches that of BRIEF.”;














page 2369:
“2.2. Learning discriminative descriptors
It has been frequently demonstrated that descriptors perform better when the separation between the intra-class distances and the inter-class distances is maximized. Given a set of labelled matching and non-matching image patches, methods like [2, 10] seek to find a projection w∗ s.t. w∗ = arg max w (wT Aw)/(wT Bw) which is the ratio of the inter A to intra-class B covariance along the direction w. Intuitively, such methods seek to minimize the expected distance between patches annotated as similar and maximize the expected distance between patches annotated as dissimilar. This has been done globally for real-valued descriptors in [2, 10, 17] with the use of a large set of negative and positive pairs of patches in an offline learning process.
In the following we propose an approach that exploits this idea to optimize a binary descriptor for each patch independently.”).  

	Thus, one of ordinary skill in the art of classification descriptors and image patches, as taught in both teachings, can modify Faktor’s teaching said “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max” and said “by describing an image” resulting in “descriptors for classification”, as modified via the combination, with Balntas’ teaching of said “optimize a binary descriptor” by modifying Faktor’s said “by describing an image” resulting in “descriptors for classification” by describing the image using Balntas’ binary descriptor resulting in binary “descriptors for classification” and recognize that the modification is predictable or looked forward to because Balntas’ descriptor “outperforms SIFT” via Balntas, page 2372, left/right column:
“In Figure 7 (top) we plot the results for a pair of images from each sequence from [11] that represents a significant transformation. Results of other image pairs are
consistent. Interestingly, SIFT gives the best results overall. However, BOLD outperforms SIFT for high precision part of the curves in Boat, Bikes and Bark sequences. It is worth noting that although BinBoost performs well in the patch dataset, it is ranked third in the matching experiment behind SIFT and BOLD. This may be due to a different training data used to optimize BinBoost and different feature points.”


Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied in claims 1-4 and 7 further in view of Liong et al. (Deep Coupled Metric Learning for Cross-Modal Matching) and Boiman et al. (Similarity by Composition).
Regarding claim 6, Faktor as combined teaches the method according to claim 1, wherein training the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”, as modified via the combination) by using preset classes (said or “difficult classes”) of training images comprises: 
training (said via “guiding it where to sample in the next iteration”) a deep convolutional neural network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) based on inter-class variance maximization (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) and by using the preset classes (said or “difficult classes”) of training images.  
	





Thus, Faktor as combined does not teach, as indicated in bold above, claim 6 as a whole. Accordingly, Liong teaches claim 6 of:
training (via “new training data”) a deep convolutional neural network (via “deep…neural networks”)-based classification (via “intra-class variation” and “inter-class variation”) model (via “more recent deep learning models” as indicated in fig. 1:Deep Coupled Metric Learning (DCML)) based on inter-class variance maximization (via “the inter-class variation is maximized”) and by using the preset classes (said via “intra-class variation” and “inter-class variation” via:
page 1234:
“A variety of cross-modal matching methods [9] have been proposed in recent years, and the typical approach is to seek one common semantic space to reduce the modality gap. For example, canonical correlation analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) (CCA) [3] was applied for
cross-modal matching where it projects two sets of features of different modalities into one common space where their correlation is maximized. Similarly, partial least square (PLS) [20] and semantic correlation matching (SCM) [4] used a similar idea to reduce the modality gap by using different statistical techniques and formulations. While these cross-modal matching methods have achieved encouraging performance, most of them
employ direct projections from the original feature representations, which usually cannot truly capture the high-level semantics from nonlinear real-world data. While there are studies that provide nonlinear transformations based on kernels [21], [22], these models are not scalable for new training data. While more recent deep learning models have provided scalable nonlinear hierarchical transformations for discriminant feature representations, only few of them have been implemented particularly for cross-modal matching [23]–[25]. Hence, how to learn a model which can extract high-level semantic representations efficiently from nonlinear relationships across different modalities remains a challenging problem in cross-modal matching.”; and











pages 1234,1235:
“In this paper, we propose a new deep coupled metric learning (DCML) method for cross-modal matching. Unlike most existing methods modal-invariant feature learning methods such as CCA and PLS which learn a single linear latent space to reduce the modality gap, our DCML designs two neural networks to learn two sets of hierarchical nonlinear transformations (one set for each modality) to nonlinearly map data samples into a shared feature subspace, under which the intra-class variation is minimized and the inter-class variation is maximized, and the difference of each sample pair captured from two modalities of the same class is minimized, respectively. Fig. 1 illustrates the basic idea of the proposed approach. Experimental results on four different cross-modal matching applications demonstrate the effectiveness of the proposed method.”).

Thus, one of ordinary skill in the art of matching images can modify Faktor’s said  “guiding it where to sample in the next iteration”, as modified via the combination, with Liong’s teaching of “new training data” with said Liong’s fig. 1:Deep Coupled Metric Learning (DCML) by:
a.	inserting Liong’s fig. 1:DCML after Faktor’s fig. 3: “Iteration 1 Uniform Sampling across all images”:
a1)	making the Siamese CNN be as the DCML;
b.	inputting Faktor’s said “difficult classes” into the DCML;
c.	outputting the “difficult classes” results of DCML into Faktor’s fig. 3: “Sparse set of meaningful affinities”; and
d.	feeding back the DCML affinities results.

and recognize that the modification is predictable or looked forward to because Liong’s 
teaching achieves “effective” “matching” (Liong, cited above) and Boiman et al. (has common author) teaches “We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types… (images, video, audio, biological data, etc.)” via equation (1) in the 3rd page, referring to Faktor’s same equation (3) in page 1095 via Boiman, Abstract:





“We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2. Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2. This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2. “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.)
We show a few such examples.”
















Claims 9,16 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied in claims 1-4 and 7 above further in view of Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories).
Regarding claim 9, claim 9 is rejected the same as claim 1. Thus, argument presented in claim 1 is equally applicable to claim 9. Accordingly, Factor teaches claim 
9 of an electronic device, comprising: 
a memory (via “Memory” or “memory”), 
a processor (via “computation” comprising “the use of a computer”), and   
computer programs stored in the memory and executable by the processor, wherein the processor executes the computer programs to implement operations comprising: 
training a siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) by using preset classes (said or “difficult classes”) of training images; 
testing classification results from the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) by using verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) to determine nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images”); 
determining similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) based on the training images; 
optimizing (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) based on the similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) and the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images”); and 
acquiring image features (said or said or “local features...across the images” corresponding to a “distinguishing feature” via said “The algorithm”, algorithm line 7: “matrix B” comprising said “descriptors” “dk” as shown in equation “Bij” in page 1099, right column, 3rd bullet comprising said a “distinguishing feature”) by using the optimized classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling” via:
page 1100:
“5.3 Complexity (Time and Memory)
All matrix computations and updates (maxðA; BÞ, B^2, update P, etc.) are efficient, both in terms of memory and computation, since the matrix B is sparse. Its only non-zero
entries correspond to the image connections generated in the current iteration.”

wherein “computation” is defined via Dictionary.com:
computation
noun
1	an act, process, or method of computing; calculation.

wherein “computing” is defined:
computing
noun
1	the use of a computer to process data or perform calculations.).  

	
Thus, Faktor does not teach, as indicated in bold above, the claimed:
A.	“computer programs stored in the memory and executable by the processor, wherein the processor executes the computer programs to implement operations”; and
B.	a siamese network-based classification model.
Regarding A., one of skill in the art of computers can modify Faktor’s said “The algorithm” by creating a program based on said “The algorithm” and execute the program and recognize that the modification is predictable or looked forward to because the modification results in “high-speed processing” such that any processing of numbers as taught in Faktor is at high-speed or faster than a common measured speed via: 
SCIENTIFIC DEFINITIONS FOR COMPUTER
computer
A programmable machine that performs high-speed processing of numbers, as well as of text, graphics, symbols, and sound. All computers contain a central processing unit that interprets and executes instructions; input devices, such as a keyboard and a mouse, through which data and commands enter the computer; memory that enables the computer to store programs and data; and output devices, such as printers and display screens, that show the results after the computer has processed data.

Thus, the combination does not teach B.
However as discussed in the rejection of claim 1, Isola teaches B of:
siamese network (via “a Convolutional Neural Net (CNN) with a Siamese-style architecture (Figure 2, Chopra et al. (2005))”)-based classification (via fig. 2: Co-occurrence classifier”: “our co-occurrence classifier”) model (via “To model w(A, B)” via:
Isola).

	

Thus, one of skill in the art of image affinities as taught by both references can modify the combination of Faktor’s said “by describing an image”, cited: pages 1093,1094, resulting in “descriptors for classification”, page 1096, left/right column, represented in fig. 3: “Guided sampling” with Isola’s teaching of said  “To model w(A, B)” by:
a)	obtaining affinities, as shown in Faktor’s fig. 3: “Sparse set of meaningful affinities” via said describing an image;
b)	modeling the affinities as shown in equation “(2)”, Isola, cited above, whereing “E(A, B, C; θ)” is the modeled affinity;
c)	clustering/classifying, as shown in Faktor’s fig. 3: “Image Clusters”, based on the CNN with Siamese-style equation (2); and
d)	recognizing that the modification is predicable or looked forward to because the modification “outperforms the baselines”, Isola, cited above, corresponding to Isola’s Table 1 in page 4: “Affinity measure” wherein “Co-occurrence classifier” is first in patches, frames and photos (as shown in Isola’s fig. 3: “Patches”; “Frames”; and “Photos”, page 5) and then “Color histogram” is second in patches (as shown in Isola’s fig. 3 “Patches”; “Frames”; and “Photos”, page 5).
Regarding claim 16, claim 16 is rejected the same as claim 7. Thus argument presented in claim 7 is equally applicable to claim 16.




	Regarding claim 10, claim 10 is rejected the same as claims 1 and 9. Thus, argument presented in claims 1 and 9 is equally applicable to claim 10. Accordingly, Faktor teaches claim 10 of a non-transitory computer readable storage medium (said memory) storing computer programs, wherein the programs are executed by a processor (said computer) to implement operations comprising: 
training a siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) by using preset classes (said or “difficult classes”) of training images; 
testing classification results from the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) by using verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) to determine nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images”); 
determining similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) based on the training images; 
optimizing (said via “max” as shown in page 1099, left column: “The algorithm”, algorithm lines 9,12: “max”) the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”) based on the similar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images” or “pairs”) and the nonsimilar (said or “ ‘partially similar’ ”) image pairs (said via “a pair of images”); and 
acquiring image features (said or “local features...across the images” corresponding to a “distinguishing feature” via said “The algorithm”, algorithm line 7: “matrix B” comprising said “descriptors” “dk” as shown in equation “Bij” in page 1099, right column, 3rd bullet comprising said a “distinguishing feature”) by using the optimized classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”).  
Thus, Faktor does not teach, as indicated in bold above, the claimed:
A.	“a computer readable storage medium storing computer programs, wherein the programs are executed by a processor to implement operations.”; and
B.	“a siamese network-based classification model”.
Thus as discussed in claim 9, one of skill in the art of computers can modify Faktor’s said “The algorithm” by creating a program based on said “The algorithm” and execute the program and recognize that the modification is predictable or looked forward to because the modification results in “high-speed processing” such that any processing of numbers as taught in Faktor is at high-speed or faster than a common measured speed.
Thus, the combination does not teach limitation B.
However as discussed in the rejection of claims 1 and 9, Isola teaches limitation B of:
siamese network (via “a Convolutional Neural Net (CNN) with a Siamese-style architecture (Figure 2, Chopra et al. (2005))”)-based classification (via fig. 2: Co-occurrence classifier”: “our co-occurrence classifier”) model (via “To model w(A, B)” via:
Isola).

	Thus as discussed above in the rejection of claims 1 and 9, one of skill in the art of image affinities as taught by both references can modify the combination of Faktor’s said “by describing an image”, cited: pages 1093,1094, resulting in “descriptors for classification”, page 1096, left/right column, represented in fig. 3: “Guided sampling” with Isola’s teaching of said  “To model w(A, B)” by:
a)	obtaining affinities, as shown in Faktor’s fig. 3: “Sparse set of meaningful affinities” via said describing an image;
b)	modeling the affinities as shown in equation “(2)”, Isola, cited above, whereing “E(A, B, C; θ)” is the modeled affinity;
c)	clustering/classifying, as shown in Faktor’s fig. 3: “Image Clusters”, based on the CNN with Siamese-style equation (2); and
d)	recognizing that the modification is predicable or looked forward to because the modification “outperforms the baselines”, Isola, cited above, corresponding to Isola’s Table 1 in page 4: “Affinity measure” wherein “Co-occurrence classifier” is first in patches, frames and photos (as shown in Isola’s fig. 3: “Patches”; “Frames”; and “Photos”, page 5) and then “Color histogram” is second in patches (as shown in Isola’s fig. 3 “Patches”; “Frames”; and “Photos”, page 5).






Claims 11-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied in claims 1-4 and 7 above further in view of Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) as applied in claims 9,16 and 10 above further in view of Chennupati (Hierarchical Decomposition of Large Deep Networks).
Regarding claim 11, Faktor as modified teaches the electronic device according to claim 9, wherein the operation of testing classification results from the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling” as modified via the combination) by using verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) to determine nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs”) further comprises:
classifying the verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) by using the siamese network-based classification model (said via “by describing an image” resulting in “descriptors for classification” represented in fig. 3: “Guided sampling”, as modified via the combination) to obtain a confusion matrix (said or “a confusion matrix”); 


performing clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix (said or “a confusion matrix”) to determine confusable (via horse confused with bike) classes (said or “category”); and 
constructing (said via fig. 2: “Sparse set of meaningful affinities” in the context of pairs as indicated by the dashed arrows in said fig. 2: “Sparse set of meaningful affinities”) the nonsimilar (said or “ ‘partially similar’ ” thus the other part is nonsimilar) image pairs (said via “a pair of images” or “pairs”) based on verification (said via “ground truth labels of the other images”) images (said or “189 images” represented in fig. 3: “Image collection”) that belong to the confusable classes (said or “category”).  
Thus, Faktor as combined does not teach as indicated in bold above, the claimed computer operation “performing clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix”. 
Accordingly, Chennupati teaches:
performing clustering (via page 57, fig. 35: “Spectral Density Clustering”) analysis (via “spectral clustering…analysis of the confusion matrix” to “extract hidden correlations” “from the class-to-class confusion matrix”) on the confusion matrix (said  “the confusion matrix” via:










pages 36,37:
“In this thesis a novel method is proposed to alleviate the computational complexity involved in training larger networks for datasets with higher number of discrete classes or concepts. Our approach uses a high-level classifier to initially determine which sub-class a sample belongs to, then passes that sample into the corresponding sub-class network to make a final class assignment. Our method automatically determines the optimal number of subclasses, then trains each sub-class in an independent fashion. The first stage of determining the number of sub-classes is called Hierarchy Clustering. In this stage by exploiting the rich information from the class-to-class confusion matrix (generated using a simplified conventional neural network mapping to all classes or concepts) to extract hidden correlations amongst classes. During training, a Hierarchy Classifier predicts which sub-network a sample
belongs. This sample is then passed into one of C Smaller Class Assignment Classifiers, each which is only concerned with a subset of classes to make a final classification estimate.

5.1. Hierarchy Clustering
To tackle problems with a large number of classes, a hierarchical approach for
clustering similar classes into sub-groups is used. This requires the training of a handful of much simpler neural networks where the number of overall parameters has been reduced. The intuition behind using hierarchical clustering is the presence of coarse categories or super classes which contain a higher number of finer classes. To categorize the given set of classes into super classes, spectral clustering of the confusion matrix is used to generate a given number of clusters. The main challenge with the hierarchical clustering scheme is the selection of an optimum merge or split breakpoints, which if done improperly, can lead to low quality clusters. To address this challenge, a multi-phase technique that is based on the analysis of the confusion matrix of the classifier in the parent stage is proposed.”

	








Thus, said one of ordinary skill in the art of confusion matrices and computers can modify Faktor’s teaching of said “a confusion matrix” with Chennupati’s teaching said “extract hidden correlations” “from the class-to-class confusion matrix” by:
a.	creating a program corresponding to Chennupati’s page 38, “Algorithm 2: Hierarchy class clustering” comprising “Input: “Confusion matrix Cp”; 
b.	executing the program; and
b.	inputting said Faktor’s teaching of said “a confusion matrix” into Chennupati’s “Algorithm 2” comprising “Input: “Confusion matrix Cp”;
and recognize that the combination is predictable or looked forward to because 
Chennupati’s matrix separation is “exploiting the rich information from the class-to-class
confusion matrix” resulting in “finer classes”, Chennupati, cited above, as shown in
Chennupati’s page 39, fig. 24: “Classes”, regarding the bike and horse clusters being in 
finer classes.
Regarding claim 12, claim 12 is rejected the same as claim 3. Thus argument presented in claim 3 is equally applicable to claim 12.








Regarding claim 13, Faktor as combined teaches the electronic device according to claim 11, wherein the operation of performing clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix (said or “a confusion matrix” as modified via the combination) to determine confusable classes (said or “category”) further comprises: 
performing spectral clustering analysis (said in terms of “each row” and “sepa-ration” “parts”, cited in the rejection of claim 1) on the confusion matrix (said or “a confusion matrix” as modified via the combination) to obtain a plurality of clusters (said via fig. 3: “Image Clusters”), wherein each of the plurality of the clusters (said via fig. 3: “Image Clusters”) comprises at least one class; and 
determining classes (said via “the car cluster contains 72 percent cars, 8 percent bicycles, 17 percent horses and 3 percent chairs”) in a cluster (said via fig. 3: “Image Clusters”) that comprises at least two classes as the confusable classes (said or “category”).









Claim 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied in claims 1-4 and 7 above in view of Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) as applied in claims 9,16 and 10 above further in view of Balntas et al. (BOLD - Binary online learned descriptor for efficient image matching).
Regarding claim 14, claim 14 is rejected the same as claim 5. Thus argument presented in claim 5 is equally applicable to claim 14.
Claim 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) in view of Isola et al. (Learning Visual Groups from Co-Occurrences in Space and Time) as applied in claims 1-4 and 7 above further in view of Faktor et al. (“Clustering by Composition”-Unsupervised Discovery of Image Categories) as applied in claims 9,16 and 10 above further in view of Liong et al. (Deep Coupled Metric Learning for Cross-Modal Matching) and Boiman et al. (Similarity by Composition)
Regarding claim 15, claim 15 is rejected the same as claim 6. Thus argument presented in claim 6 is equally applicable to claim 15.





Suggestions
Applicant’s disclosure states, corresponding to suggested claim 2, below:
“[0009] In the image feature acquisition method disclosed in the embodiments of the present application, a classification model is trained by using preset classes of training images, and similar image pairs are determined by using the training images; classification results from the classification model are tested by using verification images to determine nonsimilar image pairs relatively confusable to the classification model; and the classification model is optimized based on the similar image pairs and the nonsimilar image pairs, and image features are acquired by using the optimized classification model, so that image expressiveness of the acquired image features can be effectively improved. Confusable product image classes are determined based on classification results of verification images from an initially trained classification model, and nonsimilar image pairs are constructed based on the confusable product image classes, so that similar image pairs and the nonsimilar image pairs may be used together as training samples to optimize the initially trained classification model, thereby obtaining more accurate feature expression of product images.”

	In contrast, claim 2, last two lines states, “constructing the nonsimilar image pairs based on verification images”.
	In contrast, said Faktor (“Clustering by Composition”—Unsupervised Discovery of Image Categories) uses the confusion matrix for classifier evaluation and said Chennupati (Hierarchical Decomposition of Large Deep Networks) teaches image cluster “unweight pair group” (page 37, 3rd to last line),  Dp (Ci,Cj), based on the confusion matrix C, wherein Ci or Cj is a cluster. Thus applicant’s disclosed solution, as shown in fig. 4:310, to the problem of reduced expressiveness is an indication of non-obviousness in view of the cited art in the above rejections. 
Note that these suggestions are not provided with respect to overcoming 35 USC 101,112,102 and/or 103. These suggestion are mainly provided to seek out advantages in the disclosure regardless of 35 USC 101,112,102 and/or 103.

2. (Suggested) The method according to claim 1, wherein testing classification results from the siamese network-based classification model by using verification images to determine nonsimilar image pairs comprises:
classifying the verification images by using the siamese network-based classification model to obtain a confusion matrix;
performing clustering analysis on the confusion matrix to determine confusable classes; and
constructing the nonsimilar image pairs based on .
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 



Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENNIS ROSARIO whose telephone number is (571)272-7397. The examiner can normally be reached Monday-Friday, 9AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DENNIS ROSARIO/Examiner, Art Unit 2667 

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667