DETAILED ACTION
Response to Amendment
The amendment was received 4/28/21. Claims 1-8 are pending.
Claim Objections
Claim 7 is objected to because of the following informalities:  
Regarding claim 7, last two lines is objected for having a double comma: “k-means algorithm ,, and a”.
Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 


(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 





Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function.  Such claim limitation(s) is/are: 
“receiving…classifying…detecting…identifying…and identifying…by the computing device” in claim 1;

“determining…generating…and presenting…by the computing device” in claim 2;

“receiving…and generating….by the computing device” in claim 3; and

“receiving…verifying…and sending….by the computer device” in claim 4.


Because this/these claim limitation(s) is/are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are not being interpreted to cover only the corresponding structure, material, or acts described in the specification as performing the claimed function, and equivalents thereof.
If applicant intends to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to remove the structure, materials, or acts that performs the claimed function; or (2) present a sufficient showing that the claim limitation(s) does/do not recite sufficient structure, materials, or acts to perform the claimed function. Accordingly:
The claimed “by” (as in “receiving, by a computing device, an image”) is interpreted under the broadest reasonable interpretation as one of skill in the art would given applicant’s disclosure and via definition thereof via Dictionary.com, wherein definitions 1-24 are equally applicable:
by	
preposition
1	near to or next to:
a home by a lake.
2	over the surface of, through the medium of, along, or using as a route:
He came by the highway. She arrived by air.
3	on, as a means of conveyance:
They arrived by ship.
4	to and beyond the vicinity of; past:
He went by the church.
5	within the extent or period of; during:
by day; by night.
6	not later than; at or before:
I usually finish work by five o'clock.
7	to the extent or amount of:
The new house is larger than the old one by a great deal. He's taller than his sister by three inches.
8	from the opinion, evidence, or authority of:
By his own account he was in Chicago at the time. I know him by sight.

9	according to; in conformity with:
This is a bad movie by any standards.
10	with (something) at stake; on:
to swear by all that is sacred.
11	through the agency, efficacy, work, participation, or authority of:
The book was published by Random House.
12	from the hand, mind, invention, or creativity of:
She read a poem by Emily Dickinson. The phonograph was invented by Thomas Edison.
13	in consequence, as a result, or on the basis of:
We met by chance. We won the game by forfeit.
14	accompanied with or in the atmosphere of:
Lovers walk by moonlight.
15	in treatment or support of; for:
He did well by his children.
16	after; next after, as of the same items in a series:
piece by piece; little by little.
17	(in multiplication) taken the number of times as that specified by the second number, or multiplier:
Multiply 18 by 57.
18	(in measuring shapes) having an adjoining side of, as a width relative to a length:
a room 10 feet by 12 feet.
19	(in division) separated into the number of equal parts as that specified by the second number, or divisor:
Divide 99 by 33.
20	in terms or amounts of; in measuring units of:
Apples are sold by the bushel. I'm paid by the week.
21	begot or born of:
Eve had two sons by Adam.
22	(of quadrupeds) having as a sire:
Equipoise II by Equipoise.
23	Navigation. (as used in the names of the 16 smallest points on the compass) one point toward the east, west, north, or south of N, NE, E, SE, S, SW, W, or NW, respectively:
He sailed NE by N from Pago Pago.
24	into, at, or to:
Come by my office this afternoon.




The claimed “screenshot”, an adjective, (as in “the image is screenshot captured by the user device from a display” of claim 1) is interpreted as one of skill in the art would in light of applicant’s disclosure and definition thereof via Dictionary.com:
screenshot
noun
1	Also called screen cap·ture , screen·cap. 
a copy or image of what is seen on a computer monitor or other screen at a given time:
Save the screenshot as a graphics file.
verb (used with object) screen·shot or screen·shot·ted, screen·shot·ting.
2	to take a screenshot of:
You can screenshot the error message and send it to me.

BRITISH DICTIONARY DEFINITIONS FOR SCREENSHOT
screenshot
noun
1	an image created by copying part or all of the display on a computer screen at a particular moment, for example in order to demonstrate the use of a piece of software

The claimed “the object[[s]]” in claim 2, last line is interpreted under the broadest reasonable interpretation in light of applicant’s disclosure in the context of “multiple objects to be identified and located within the same image” via applicant’s disclosure:
[0002] Humans are capable of looking at an image or watching a video and readily identifying, people, objects, scenes, and other visual details. Object recognition has become an ever increasingly important facet of modern technology. Object recognition, with respect to technology, is a computer vision technique for identifying objects in images or videos. Object recognition techniques may use various means to identify objects such as deep learning and machine learning algorithms. Further, object recognition techniques may be combined with object detection techniques. Object detection and object recognition are similar techniques for identifying objects, but they vary in their execution. Object detection is the process of finding instances of objects in images. In the case of deep learning, object detection is a subset of object recognition, where the object is not only identified but also located in an image. This allows for multiple objects to be identified and located within the same image.


The claimed “request” (as in “a request to acquire the object from the user device from one of the one or more sources” in claim 4) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, definitions 1-5 are equally applicable:
request, noun
1	the act of asking for something to be given or done, especially as a favor or courtesy; solicitation or petition:
At his request, they left.
2	an instance of this:
There have been many requests for the product.
3	a written statement of petition:
If you need supplies, send in a request.
4	something asked for:
to obtain one's request.
5	the state of being asked for; demand.














Response to Arguments
Applicant’s arguments, see remarks, pages 5-7, filed 4/28/21, with respect to the claim objection and 35 USC 112(b) have been fully considered and are persuasive.  The claim objection of claims 2-4 and 8 has been withdrawn in the Office action of 1/29/21, page 2. Thus, the 35 USC 112(b) rejection of claim 2 has been withdrawn in the Office action of 1/29/21, page 10. 
Applicant’s state in page 6:
“Applicant respectfully requests this provisional rejection be held in abeyance until no other rejections remain, since the instant case as of the time of this paper does not include allowable claims, and since the claims in the instant case may be amended prior to allowance in such a way to obviate any such rejection.”

In response, the examiner notes MPEP 804 I B. Between Copending Applications—Provisional Rejections, 2nd paragraph:
“A provisional double patenting rejection should continue to be made by the examiner until the rejection has been obviated or is no longer applicable except as noted below.”

Thus, the double patenting rejection is continued to be made, as shown below, by the examiner until the rejection has been obviated or is no longer applicable.







Applicant’s arguments, see remarks, pages 7-9, filed 4/28/21, with respect to the rejection(s) of claim(s) 1,2,5,6 under 35 USC 102 and 3 and 4 and 6 and 7 and 8 under 35 USC 103 in the Office action of 1/29/21, pages 15-36 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  
However, upon further consideration, a new ground(s) of rejection is made in view of 35 USC 103 in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition) wherein Oramas teaches inputting images, text and audio (different modalities or ways of inputting , likes eyes, ears, receptors, smell , light sensor, radiation detector, x-ray detector, UV CCD, IR CCD, photodetector, optical sensor etc.) into a learning classifier (that has eyes and ears and can read) wherein the inner brain-like processing of the eyes can be viewed via an accelerated television called Barnes-Hut t-SNE (a screen-shot of which is shown in Oramas: fig. 2:fast version after modification) via Xue instead of watching TV with lag called t-SNE (a screen-shot of which is shown in Oramas: fig. 2:slow version before modification).

 

 



Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-5,7 and 8 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 9,10,12-17,19 and 20 of copending Application No. 16/354,240 (Mathada et al., US Patent App. Pub. No.: US 2020/0293819 A1). 
Accordingly, the more specific pending claims 9,10,12-17,19 and 20 of copending Application No. 16/354,240 anticipate claims 1-8 of application 16/460,286.
For example, all limitations in claim 1 of application 16/460,286 are anticipated/present in the more specific pending claim 9 of copending Application No. 16/354,240.
Thus, claims 2-5,7 and 8 are rejected under a similar analysis as done for claim 1.
This is a provisional nonstatutory double patenting rejection.



Claim 6 is provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 9,10,12-17,19 and 20 of copending Application No. 16/354,240 (Mathada et al., US Patent App. Pub. No.: US 2020/0293819 A1) in view of Yang et al. (US Patent App. Pub. No.: US 2014/0250110 A1). 
Regarding claim 6, claims 9,10,12-17,19 and 20 of copending Application No. 16/354,240 do not teach claim 6’s “a saliency detection algorithm”. Accordingly, Yang teaches the saliency detection algorithm (via a “saliency detection algorithm” via:
“[0048] The visual analysis component 110 may analyze the perceptual quality of 
the labeled image by determining the brightness, the contrast, the colorfulness, the sharpness, and/or the blur of the labeled image 308.  In an example implementation, to determine the brightness and the contrast, the mean (brightness) and standard deviation (contrast) of pixel intensity in gray are analyzed, though other conventional techniques may also be employed.  Colorfulness may be determined by analyzing the mean and standard deviation of saturation and hue, or a contrast of colors, for example.  Meanwhile, sharpness may be determined by, for example, a mean and standard deviation of a Laplacian image normalized by local average luminance.  Blur may be determined by, for example, frequency distribution of an image transformed according to a Fast Fourier Transform (FFT).  In addition to analyzing perceptual quality features such as brightness, colorfulness, sharpness, and blur, the visual analysis component 100 may apply a saliency detection algorithm to the labeled image 308.  Saliency detection extracts features of objects in images that are distinct and representative.  For instance, the visual analysis component 100 may apply the saliency detection algorithm to extract features over the whole image with pixel values reweighted by a saliency map (e.g., an image of extracted saliency features indicating a saliency of a corresponding region or point).  Alternatively, the visual analysis component 110 may apply the saliency detection algorithm over a subject region in the image.  For instance, the subject region may be detected by a minimal bounding box that contains 90% mass of all saliency weights in order to determine lighting, color, and sharpness of the saliency map reweighted image.”
	
	Thus, one of ordinary skill in the art of saliency can modify claim 9 and 16’s “salient object” with Yang’s teaching of the “saliency detection algorithm” by applying the algorithm to claim 9’s and 16’s image and recognize that the modification is predictable or looked forward to because the modification results in detecting a clear/sharp/definite (i.e., salient) image with “simplicity”, Yang cited below in another rejection of claim 6, such that superposing images of claims 9 and 16 is easier than using non-clear/sharp/definite images.
This is a provisional nonstatutory double patenting rejection.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Regarding inquiry 4, see Suggestions.
Claims 1,2,5,6 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US Patent App. Pub. No.: US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition). Claim 6 is rejected twice.




Regarding claim 1, Rhoads teaches a method for object detection and identification, the method comprising:
receiving (“received”, cited below: [0566], as indicated in figs. 1 and 3:zig-zag lines), by (as indicated in fig. 3) a computing device (fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), an image (or “image” of “Godzilla”, cited below: [0015]) from a user device (or “FIG 0”: box with buttons), wherein the image is screenshot captured by the user device (said fig. 0: box with buttons) from a display (comprised by said fig. 0: box with buttons); 
classifying (resulting in “categories”), by (said as indicated in fig. 3) the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), the image (said or “image” of “Godzilla”), wherein the image (said or “image” of “Godzilla”) is classified (into said “categories”) based on features (or “image features/characteristics/metrics”) present in the image (said or “image” of “Godzilla”); 
detecting, by (said as indicated in fig. 3) the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), an object (comprised by “objects in the captured frame”) contained within the image (said or “image” of “Godzilla”), wherein the object (said comprised by “objects in the captured frame”) is a salient (via “salient” “feature” “metrics”) object (comprised by “objects in the captured frame” measured via said “salient” “feature” “metrics”); 



identifying (such that “each object is identified”), by (said as indicated in fig. 3)  the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), the object (comprised by “objects in the captured frame”) in the image (said or “image” of “Godzilla”), wherein the object is identified (via said such that “each object is identified”) using multi-modal learning techniques, and wherein the multi-modal learning techniques comprise a Barnes-Hut approximation; and 
identifying (via either of an “addressing” “scheme” or “identified” “sources”), by (said as indicated in fig. 3) the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), one or more sources (or “the cloud resource” via said “addressing” “scheme”, represented in fig. 10A: “oval on the left”, or “Collections…and other content…resources can also serve as” “identified” “sources”) of the object (comprised by “objects in the captured frame”) in the image (said or “image” of “Godzilla” via:
“[0014] Certain aspects of the technology detailed herein are introduced in FIG. 
0.  A user's mobile phone captures imagery (either in response to user command, or autonomously), and objects within the scene are recognized.  Information associated with each object is identified, and made available to the user through a scene-registered interactive visual "bauble" that is graphically overlaid on the imagery.  The bauble may itself present information, or may simply be an indicia that the user can tap at the indicated location to obtain a lengthier listing of related information, or launch a related function/application.”;

“[0015] In the illustrated scene, the camera has recognized the face in the foreground as "Bob" and annotated the image accordingly.  A billboard promoting the Godzilla movie has been recognized, and a bauble saying "Show Times" has been blitted onto the display--inviting the user to tap for screening information.”;

“[0105] Relatedly, it seems that there should be a common denominator set of "device-side" operations performed on visual data that will serve all cloud processes, including certain formatting, elemental graphic processing, and other rote operations.  Similarly, it seems there should be a standardized basic header and addressing scheme for the resulting communication traffic (typically packetized) back and forth with the cloud.”
“[0125] Elements of the foregoing are distilled in FIG. 10A, showing an implementation of aspects of the technology as a physical matter of (usually) software components.  The two ovals in the figure highlight the symmetric pair of software components which are involved in setting up a "human real-time" visual recognition session between a mobile device and the generic cloud or service providers, data associations and visual query results.  The oval on the left refers to "keyvectors" and more specifically "visual keyvectors." As noted, this term can encompass everything from simple JPEG compressed blocks all the way through log-polar transformed facial feature vectors and anything in between and beyond.  The point of a keyvector is that the essential raw information of some given visual recognition task has been optimally 
pre-processed and packaged (possibly compressed).  The oval on the left assembles these packets, and typically inserts some addressing information by which they will be routed.  (Final addressing may not be possible, as the packet may ultimately be routed to remote service providers--the details of which may not yet be known.) Desirably, this processing is performed as close to the raw sensor data as possible, such as by processing circuitry integrated on the same substrate as the image sensor, which is responsive to software instructions stored in memory or provided from another stage in packet form.”

“[0451] In turn, the cloud resource may alert the cell phone of any information it expects might be requested from the phone in performance of the expected operation, or action it might request the cell phone to perform, so that the cell phone can similarly anticipate its own forthcoming actions and prepare accordingly.  For example, the cloud process may, under certain conditions, request a further set of input data, such as if it assesses that data originally provided is not sufficient for the intended purpose (e.g., the input data may be an image without sufficient focus resolution, or not enough contrast, or needing further filtering).  Knowing, in advance, that the cloud process may request such further data can allow the cell phone to consider this possibility in its own operation, e.g., keeping processing modules configured in a certain filter manner longer than may otherwise be the case, reserving an interval of sensor time to possibly capture a replacement image, etc.”

“[0472] Collections of publicly-available imagery and other content are becoming more prevalent.  Flickr, YouTube, Photobucket (MySpace), Picasa, Zooomr, FaceBook, Webshots and Google Images are just a few.  Often, these resources can also serve as sources of metadata--either expressly identified as such, or inferred from data such as file names, descriptions, etc. Sometimes geo-location data is also available.”;







“[0477] After feature metrics for the image are determined, a search is conducted through one or more publicly-accessible image repositories for images with similar metrics, thereby identifying apparently similar images.  (As part of its image ingest process, Flickr and other such repositories may calculate eigenvectors, color histograms, keypoint descriptors, FFTs, or other classification data on images at the time they are uploaded by users, and collect same in an index for public search.) The search may yield the collection of apparently similar telephone images found in Flickr, depicted in FIG. 22.”;

“[0296] She touches the virtual shutter button, capturing a frame of high resolution imagery, and image analysis gets underway--trying to recognize what's in the field of view, so that the camera application can overlay graphical links related to objects in the captured frame.  (Or this may happen without user action--the camera may be watching proactively.)”;

“[0566] An illustrative usage model is as follows.  A system responds to an image 128 (either optically captured or wirelessly received) by displaying a collection of related images to the user, on the cell phone display.  For example, the user captures an image and submits it to a remote service.  The service determines image metrics for the submitted image (possibly after pre-processing, as detailed above), and searches (e.g., Flickr) for visually similar images.  These images are transmitted to the cell phone (e.g., by the service, or directly from Flickr), and they are buffered for display.  The service can prompt the user, e.g., by instructions presented on the display, to repeatedly press the right-arrow button 116b on the four-way controller (or press-and-hold) to view a sequence of pattern-similar images (130, FIG. 45A).  Each time the button is pressed, another one of the buffered apparently-similar images is displayed.”; 
and



















“[0663] A fixed set of image assessment criteria can be applied to distinguish images in the three categories.  However, the detailed embodiment determines such criteria adaptively.  In particular, this embodiment examines the set of images and determines which image features/characteristics/metrics most reliably (1) group like-categorized images together (similarity); and (2) distinguish differently-categorized images from each other (difference).  Among the attributes that may be measured and checked for similarity/difference behavior within the set of images are dominant color; color diversity; color histogram; dominant texture; texture diversity; texture histogram; edginess; wavelet-domain transform coefficient histograms, and dominant wavelet 
coefficients; frequency domain transfer coefficient histograms and dominant frequency coefficients (which may be calculated in different color channels); eigenvalues; keypoint descriptors; geometric class probabilities; symmetry; percentage of image area identified as facial; image autocorrelation; low-dimensional "gists" of image; etc. (Combinations of such metrics may be more reliable than the characteristics individually.)
[0664] One way to determine which metrics are most salient for these purposes 
is to compute a variety of different image metrics for the reference images.  If the results within a category of images for a particular metric are clustered (e.g., if, for place-centric images, the color histogram results are clustered around particular output values), and if images in other categories have few or no output values near that clustered result, then that metric would appear well suited for use as an image assessment criteria.  (Clustering is commonly performed using an implementation of a k-means algorithm.)”).

Thus, Rhoads does not teach, as indicated in bold above, the claimed:
using multi-modal learning techniques, and wherein the multi-modal learning techniques comprise a Barnes-Hut approximation.









Accordingly, Oramas teaches claim 1 of:
using (or “exploit”) multi-modal learning techniques (via a “multimodal…learning approach”), and wherein the multi-modal learning techniques (said via a “multimodal…learning approach”) comprise a Barnes-Hut approximation (via:
section: 1 INRODUCTION, 2nd paragraph:
“To this end, we present MuMu, a new large-scale multimodal dataset for multi-label music genre classification. MuMu contains information of roughly 31k albums classified into one or more 250 genre classes. For every album we analyze the cover image, text reviews, and audio tracks, with a total number of approximately 147k audio tracks and 447k album reviews. Furthermore, we exploit this dataset with a novel deep learning approach to learn multiple genre labels for every album using different data modalities (i.e., audio, text, and image). In addition, we combine these modalities to study how the different combinations behave.”).
	
Thus, one of ordinary skill in vector classification and image/video/audio in recognizing, as indicated in Rhoads:
“[0476] (Uses of vector characterizations/classifications and other image/video/audio metrics in recognizing faces, imagery, video, audio and other patterns are well known and suited for use in connection with certain embodiments of the present technology.  See, e.g., patent publications 20060020630 and 20040243567 (Digimarc), 20070239756 and 20020037083 (Microsoft), 20070237364 (Fuji Photo Film), U.S.  Pat.  No. 7,359,889 and U.S. Pat.  No. 6,990,453 (Shazam), 20050180635 (Corel), U.S.  Pat.  No. 6,430,306, U.S.  Pat.  No. 6,681,032 and 20030059124 (L-1 Corp.), U.S.  Pat.  No. 
7,194,752 and U.S.  Pat.  No. 7,174,293 (Iceberg), U.S.  Pat.  No. 7,130,466 (Cobion), U.S.  Pat.  No. 6,553,136 (Hewlett-Packard), and U.S.  Pat.  No. 6,430,307 (Matsushita), and the journal references cited at the end of this disclosure.  When used in conjunction with recognition of entertainment content such as audio and video, such features are sometimes termed content ‘fingerprints’ or ‘hashes.’)”

can modify Rhoads’ said “each object is identified” with Oramas’ said  “multimodal… learning approach” by:
a)	having “Jane” (Rhoads: cited below and fig. 62) and BOB (Rhoads: fig. 0: 
“BOB” and fig. 62) go see Godzilla! and attend a “Paul Simon” (Rhoads: cited below)
concert;

b)	making Rhoads’ “object” of said “each object is identified” be the “image” of the 
“different data modalities (i.e., audio, text, and image)” of Oramas;
c)	making Rhoads’ “ ‘Jane's review: Pretty Good!’ ” be the “text” of said “different data modalities (i.e., audio, text, and image)” of Oramas via Rhoads:

“[0016] The phone has recognized the user's car from the scene, and has also identified--by make and year--another vehicle in the picture.  Both are noted by overlaid text.  A restaurant has also been identified, and an initial review from a collection of reviews ("Jane's review: Pretty Good!") is shown.  Tapping brings up more reviews.”

d)	making Rhoads’ fig. 20A: “Image Classification” or Rhoads’ fig. 20A: “Image/Facial Recognition” be as Oramas’ “classification from these images” via Oramas:
“5.3 Image-based Approach 
Every album in the dataset has an associated cover art image. To perform music genre 
classification from these images, we use Deep Residual Networks (ResNets) [11]. They are the state-of-the-art in various image classification tasks like Imagnet [35] and 
Microsoft COCO [19]. ResNet is a common feed-forward CNN with residual learning, 
which consists on bypassing two or more convolution layers. We employ a slightly 
modified version of the original ResNet 5 : the scaling and aspect ratio augmentation 
are obtained from [41], the photometric distortions from [12], and weight decay is 
applied to all weights and biases. The network we use is composed of 101 layers 
(ResNet101), initialized with pretrained parameters learned on ImageNet. This is our 
starting point to finetune the network on the genre classification task. Our ResNet 
implementation has a logistic regression final layer with sigmoid activations and uses 
the binary cross entropy loss.”;















e)	inputting Rhoads’ “ ‘Jane's review: Pretty Good!’ ” to “genre classification from 

text” via Oramas:	

“5.2 Text-based Approach 
In the presented dataset, each album has a variable number of customer reviews. We use an approach similar to [13, 29] for genre classification from text, where all reviews from the same album are aggregated into a single text. The aggregated result is truncated at 1000 characters, thus balancing the amount of text per album, as more popular artists tend to have a higher number of reviews. Then we apply a Vector Space Model approach (VSM) with tfidf weighting [47] to create a feature vector for each album. Although word embeddings [25] with CNNs are state-ofthe-art in many text classification tasks [15], a traditional VSM approach is used instead, as it seems to perform better when dealing with large texts [31]. The vocabulary size is limited to 10k as it was a good balance of network complexity and accuracy.”

f)	classifying or recognizing via said “multimodal…learning approach” based on the image of the “object” and Rhoads’ “ ‘Jane's review: Pretty Good!’ ”; and
g)	making a similar modification regarding “collection of reviews” (Rhoads: cited 
above [0016]) and “iTunes” and “image of Paul Simon” via Rhoads:
“[0151] As another example, consider a Facebook user who has earned, or paid for, or otherwise received credit that can be applied to certain services--such as for downloading songs from iTunes, or for music recognition services, or for identifying clothes that go with particular shoes (for which an image has been submitted), etc. These services may be associated with the particular Facebook page, so that friends can invoke the services from that page--essentially spending the host's credit (again, with suitable authorization or invitation by that hosting user).  Likewise, friends may submit images to a facial recognition service accessible through an application associated with the user's Facebook page.  Images submitted in such fashion are analyzed for faces of the host's friends, and identification information is returned to the 
submitter, e.g., through a user interface presented on the originating Facebook page.  Again, the host may be assessed a fee for each such operation, but may allow authorized friends to avail themselves of such service at no cost.”; and









“[0750] In another example, a first user snaps an image of Paul Simon at a concert.  The system automatically posts the image to the user's Flickr account--together with metadata inferred by the procedures detailed above.  (The name of the artist may have been found in a search of Google for the user's geolocation; e.g., a Ticketmaster web page revealed that Paul Simon was playing that venue that night.) The first user's picture, a moment later, is encountered by a system processing a second concert-goer's photo of the same event, from a different vantage.  The second user is shown the first user's photo as one of the system's responses to the second photo.  The system may 
also alert the first user that another picture of the same event--from a different viewpoint--is available for review on his cell phone, if he'll press a certain button twice.”; 

and

h)	recognizing that the combination is predictable or looked forward to because the
modification “improves the results” or achieves the “best” results (as shown in Oramas’
Table 2, in section 6.1 Audio Classification, showing different types of audio, text and 
Image classifications) regarding “how accurate the classification is” and is more 
accurate than image classification or recognition alone thus providing the improved 
classification accuracy, with respect to “single modality approaches”, of the “object”,
such as Rhoads’ fig. 0: “GODZILLA!” or Paul Simon, in the context of Rhoads’ “ ‘Jane's
review: Pretty Good!’ ” via Oramas: 
section 4.2 Evaluation Metrics, 2nd paragraph:
“The output of a multi-label classifier is a label-item matrix. Thus, it can be evaluated either from the labels or the items perspective. We can measure how accurate the classification is for every label, or how well the labels are ranked for every item. In this work, the former point of view is evaluated with the AUC measure, which is computed for every label and then averaged. We are interested in classification models that strengthen the diversity of label assignments. As the taxonomy is composed of broad genres which are over-represented in the dataset (see Table 1), and more specific subgenres (e.g., Vocal Jazz, Britpop), we want to measure whether the classifier is focusing only on over-represented genres, or on more fine-grained ones. To this end, catalog coverage (also known as aggregated diversity) is an evaluation measure used in the extreme multi-label classification [14] and the recommender systems [32] communities. Coverage@k measures the percentage of normalized unique labels present in the top k predictions made by an algorithm across all test items. Values of k = 1, 3, 5 are typically employed in multi-label classification.”; and


section 6.4 Mulimodal Classification, 2nd paragraph:
“Results suggest that the combination of modalities outperforms single modality approaches. As image features are learned using a LOGISTIC configuration, they seem to improve multimodal approaches with LOGISTIC configuration only. Multimodal approaches that include text features tend to improve the results. Nevertheless, the best approaches are those that exploit the three modalities of MuMu. COSINE approaches have similar AUC than LOGISTIC approaches but a much better catalog coverage, thanks to the spatial properties of the factor space.”.

	Thus, the combination does not teach, as indicated in bold above, the claimed
“comprise a Barnes-Hut approximation”. Accordingly, Xue teaches:
comprise a Barnes-Hut approximation (or “Barnes-Hut tSNE” “to approximate the embedded distribution” via:
pages 2,3:
“The t-Distributed Stochastic Neighbor Embedding (tSNE) [20] provides a 2D embedding and Barnes-Hut tSNE [33] accelerates the original t-SNE from O(n2) to O(n log n). Both t-SNE and and Barnes-Hut t-SNE are non-parametric embedding algorithms, so there is no natural way to perform out-of-sample extension. Parametric
t-SNE [32] and supervised t-SNE [23, 24] introduce deep neural networks into data embedding and realize non-linear parametric embedding. Inspired by this work, we introduce a method for texture manifolds that treats the embedded distribution from non-parametric embedding algorithms as an output, and use a deep neural network to predict the manifold coordinates of a texture image directly. This texture manifold uses the features of the DEP network and is referred to as DEP-manifold.”

page 7:
“5. Texture Manifold
Inspired by Parametric t-SNE [32] and supervised t-SNE [23, 24], we introduce a parametric texture manifold approach that learns to approximate the embedded distribution of non-parametric embedding algorithms [20, 33] using a deep neural network to directly predict the 2D manifold coordinates for the texture images. We refer to this manifold learning method using DEP feature embedding as DEP-manifold. Following prior work [24,32], the deep neural network structure is depicted in Figure 6. Input features are the feature maps before the classification layer of DEP, which means each image is represented by a 128 dimensional vector. Unlike the experiment in [24, 32], we add non-linear functions (Batch Normalization and ReLU) before fully connected layers, and we do not pre-train the network with a stack of Restricted Boltzmann Machines (RBMs) [13]. We train the embedding network from scratch instead of the three-stage training procedure (pre-training, construction and fine-tuning) in parametric t-SNE and supervised t-SNE. We randomly choose 60000 images from the multi-scale GTOS dataset for the experiment. We experiment with DEP-parametric t-SNE, and DEP-manifold based on outputs from the last fully connected layer of DEP.”).
	Thus one of ordinary skill in the art of t-SNE can modify Rhoads’ said “each object is identified” as modified via the combination with Xue’s teaching of Barnes-Hut t-SNE by:
a)	using said “Barnes-Hut t-SNE” instead of “the original t-SNE”; and
b)	recognizing that the modification is predictable or looked forward to because the modification “accelerates the original t-SNE” (Xue: cited above) thus providing an “informative” “visual style” giving information or that is instructive, as shown in Oramas’ fig. 2: “Particular of the t-SNE…, regarding visual style faster than originally “using t-SNE” via Oramas, section 6.3 Image Classification, 2nd paragraph:
“In Figure 2 a set of cover images of five of the most frequent genres in the dataset is shown using t-SNE over the obtained image feature vectors. In the left top corner theResNet recognizes women faces on the foreground, which seems to be common in Country albums (red). The jazz albums (green) on the right are all clustered together probably thanks to the uniform type of clothing worn by the people of their covers. Therefore, the visual style of the cover seems to be informative when recognizing the album genre. For instance, many classical music albums include
an instrument in the cover, and Dance & Electronics covers are often abstract images with bright colors, rarely including human faces.”.










Regarding claim 2, Rhoads as combined teaches the method as in claim 1, wherein identifying, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), one or more sources (said such that “resources can also serve as” “identified” “sources”) of the object (said comprised by “objects in the captured frame”) in the image (said or “image” of “Godzilla”) further comprises: 
determining, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), a location (or “Geolocation”) of the user device (said or “FIG 0”: box with buttons: “the cell phone”); 
generating, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), a list (via “a ranked list”) of sources (or “the different data sources” as shown in fig. 47 such as “GOOGLE” “FLICKR” “FACEBOOK” “PICASA”) of the object (said comprised by “objects in the captured frame”) based on the location (said or “Geolocation”) of the user device (said or “FIG 0”: box with buttons); and 
presenting, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), the list (said via “a ranked list”) of sources (said or “the different data sources” as shown in fig. 47 such as “GOOGLE” “FLICKR” “FACEBOOK” “PICASA”) of the object (said comprised by “objects in the captured frame”) to a user (fig. 0: “BOB”) on the user device (said or “FIG 0”: box with buttons), thereby allowing the user (said fig. 0: “BOB”) to compare (comprised in BOB via said “a ranked list”) respective (rust) conditions (of cars in fig. 0) and locations (parked in the same side of the road in fig. 0) for each of the objects (said comprised by “objects in the captured frame” via:
“[0062] FIG. 47 shows some of the different data sources that may be consulted in processing imagery according to aspects of the present technology.”

“[0171] Image 44 is a snapshot of friends.  Facial detection and recognition may be employed (i.e., to indicate that there are faces in the image, and to identify particular faces and annotate the image with metadata accordingly, e.g., by reference to user-associated data maintained by Apple's iPhoto service, Google's Picasa service, Facebook, etc.) Some facial recognition applications can be trained for non-human faces, e.g., cats, dogs animated characters including avatars, etc. Geolocation and date/time information from the cell phone may also provide useful information.”

“[0297] In one particular arrangement, visual "baubles" (FIG. 0) are overlaid on 
the captured imagery.  Tapping on any of the baubles pulls up a screen of information, such as a ranked list of links Unlike Google web search--which ranks search results in an order based on aggregate user data, the camera application attempts a ranking customized to the user's profile.  If a Starbucks sign or logo is found in the frame, the Starbucks link gets top position for this user.”).

Regarding claim 5, Rhoads as combined teaches the method as in claim 1, wherein the screenshot is captured from the display displaying at least one of the group consisting of: a movie (or “films” starring or comprising said Godzilla), a television program, and a commercial (wherein Godzilla is defined via Dictionary.com:
Godzilla
Trademark.
1	a science-fiction monster that resembles an enormous bipedal lizard, featured in Japanese and American films, television, and comic books).








Regarding claim 6, Rhoads as combined teaches the method as in claim 1, wherein the object (said comprised by “objects in the captured frame”) is detected (via “edge detection”) using a saliency (via said “objects in the captured frame” comprising said “salient” “feature” “metrics”) detection (via said “edge detection”) algorithm (comprised by “elemental image processing routines” via:
“[0114] FIG. 6 takes a major step toward the concrete, sacrificing simplicity in the process.  Here we see a top portion labeled "Resident Call-Up Visual Processing Services," which represents all of the possible list of applications from FIG. 2 that a given mobile device may be aware of, or downright enabled to perform.  The idea is that not all of these applications have to be active all of the time, and hence some sub-set of services is actually "turned on" at any given moment.  The turned on applications, as a one-time configuration activity, negotiate to identify their common component tasks, labeled the "Common Processes Sorter"--first generating an overall common list of pixel 
processing routines available for on-device processing, chosen from a library of these elemental image processing routines (e.g., FFT, filtering, edge detection, resampling, color histogramming, log-polar transform, etc.).  Generation of corresponding Flow Gate Configuration/Software Programming information follows, which literally loads library elements into properly ordered places in a field programmable gate array set-up, or otherwise configures a suitable processor to perform the required component tasks.”).

Regarding claim 7, Rhoads as combined teaches the method as in claim 1, wherein further comprise at least one of the group consisting of: a neural network, a convolutional neural network (CNN), a background subtraction technique, a k-means algorithm (or “k-means algorithm”), .

Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition) as applied above further in view of IDS cited Lester (US Patent 10,475,145).
Regarding claim 3, Rhoads as combined teaches the method as in claim 1, further comprising: 
receiving, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), a second image (via fig. 4: “PIXELS” each of which is an image), the second image (via fig. 4: “PIXELS” each of which is an image) being an image of the user (or fig. 0: “BOB”), from the user device (said or “FIG 0”: box with buttons); and 
generating, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), a second image (via fig. 4: “PIXELS” each of which is an image) of [[the]] a user (said or fig. 0: “BOB”) with the object (said comprised by “objects in the captured frame”), wherein the second image (via fig. 4: “PIXELS” each of which is an image) is generated using at least one convolutional neural network.
Thus, Rhoads does not teach, as indicated in bold above, the claimed “the second image is generated using at least one convolutional neural network”. 


Accordingly, Lester teaches:
wherein the second image (as shown in fig. 4B:400B relative to fig. 4A:400A) is generated (“to generate a watermarked image” as shown in said fig. 4B:400B) using at least one convolutional neural network (via fig. 2:240: “CONVOLUTIONAL NEURAL NETWORK” “identifying…a salient region”, a water-house, via c.1, ll. 36-52:
According to certain aspects of the present disclosure, a method for providing a watermark on an image is provided.  The method includes generating a saliency map for a user-provided image where the saliency map includes a saliency value of a plurality of pixels in the user-provided image.  The method also includes identifying, based on the saliency map, a salient region of the user-provided image having a highest saliency value and a non-salient region of the user-provided image having a lowest saliency value where a saliency value is indicative of the likelihood that a pixel within the user-provided image is important.  The method further includes determining a level of aggressiveness of a watermark to use with the user-provided image based on a weight model.  The method includes configuring the watermark to overlap with at least one of the identified salient region or the non-salient region based on the determined level of aggressiveness to generate a watermarked image.”).

Thus, one of ordinary skill in data-hiding can modify Rhoads’ image of “BOB” with a watermark as taught in Lester by modifying said Rhoads’ box with buttons in view of Lester’s fig. 2:110:CLIENT and recognize that the modification is predictable or looked forward to because Lester’s watermarking-client of fig. 2:110 enables “balance between the security and the enjoyment of image” via Lester, c.14,l. 60 to c.15,l. 2:
“In some embodiments, to balance between the security and the enjoyment of image, the system may be configured to overlap a more intrusive watermark on a less salient region or object and overlap a less intrusive watermark on a more salient region or object.  A watermark that can modify the actual image itself can be considered as a more intrusive watermark (e.g., blurring, blacking out).  Alternatively, a watermark that enables the users to view the content behind the watermark can be considered as a less intrusive watermark (e.g., overlaying text over an image).”



Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition) as applied above further in view of IDS cited Hudson et al. (US Patent 9,195,819).
Regarding claim 4, Rhoads as combined teaches a method as in claim 1, further comprising:
receiving (via said “received” as indicated in figs.1&3: zig-zag lines), by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), a request (via “camera set-up module may request images” or “the cloud process may…request a further set of input data”) to acquire the object (said comprised by “objects in the captured frame”) from the user device (said or “FIG 0”: box with buttons) from one of the one or more sources (via said or “the cloud resource” via an “addressing” “scheme”, represented in fig. 10A: “oval on the left”, and “Collections… and other content…resources can also serve as” “identified” “sources” as shown in fig. 47 such as “GOOGLE” “FLICKR” “FACEBOOK” “PICASA”); 
verifying, by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), the request (either of said “request”) to acquire the object (said comprised by “objects in the captured frame”); and 



sending (via said “received” as indicated in figs.1&3 :zig-zag lines), by the computing device (said fig. 3:antenna: “CROWN CASTLE AMERICAN TOWER SBA COMM…”), the request (either of said “request”) to acquire the object (said comprised by “objects in the captured frame”) to the source (via said or “the cloud resource” via an “addressing” “scheme”, represented in fig. 10A: “oval on the left”, and “Collections… and other content…resources can also serve as” “identified” “sources” as shown in fig. 47 such as “GOOGLE” “FLICKR” “FACEBOOK” “PICASA”) via:
“[0233] Camera control can also be responsive to spatial coordinate information.  
By using geolocation data, and orientation (e.g., by magnetometer), the camera can check that it is capturing an intended target.  The camera set-up module may request images of not just certain exposure parameters, but also of certain subjects, or locations.  When a camera is in the correct position to capture a specific subject (which may have been previously user-specified, or identified by a computer process), one or more frames of image data automatically can be captured.  (In some arrangements, the orientation of the camera is controlled by stepper motors or other electromechanical arrangements, so that the camera can autonomously set the azimuth and elevation to capture image data from a particular direction, to capture a desired subject.  Electronic or fluid steering of the lens direction can also be utilized.”

“[0451] In turn, the cloud resource may alert the cell phone of any information it expects might be requested from the phone in performance of the expected operation, or action it might request the cell phone to perform, so that the cell phone can similarly anticipate its own forthcoming actions and prepare accordingly.  For example, the cloud process may, under certain conditions, request a further set of input data, such as if it assesses that data originally provided is not sufficient for the intended purpose (e.g., the input data may be an image without sufficient focus resolution, or not enough contrast, or needing further filtering).  Knowing, in advance, that the cloud process may request such further data can allow the cell phone to consider this possibility in its own operation, e.g., keeping processing modules configured in a certain filter manner longer than may otherwise be the case, reserving an interval of sensor time to possibly capture a replacement image, etc.”).

Thus, Rhoads does not teach, as shown in bold above, “verifying… the request”.
Accordingly, Hudson teaches:

verifying (via “verifying ownership” via fig. 8:815: “determine if…genuine”), by (as indicated in fig. 1) the computer device (via fig. 1:100 and 115)…the request (or “name”, such as fig. 5: “John Doe” or said “BOB”, as “requested by the server” represented in fig. 8 as back-arrows between fig.8:810 back-to 805, a signature marking step, and fig.8:840 back-to said 805 via c.14,ll. 32-58:
“The next step in verifying ownership of a physical book is that the server (115) sends a message to the client (100) to instruct the client operator (120) to mark their physical book in a specific way.  The marking is typically made in a permanent manner, for example using permanent ink.  In one embodiment of the invention, the server instructs the client to instruct the user to write their name on the physical book's copyright page (505).  Once the user has written his or her name on the physical book in the place requested by the server (e.g. the copyright page), the user is required to capture an image of the mark using the personal electronic device.  For example, using the capture image button (510) illustrated in FIG. 5, the user may capture an image of their name written on the copyright page.  The captured image may include the entire copyright page and the page facing the copyright page (525), for example.  Onscreen guidelines (515) and a live preview (525) from the smartphone's image sensor and flash (520) are provided to aid the user in aligning the physical book with the user's mark visible with the angle of imaging requested by the server.  This image of the physical book's copyright page and facing page with the user's hand written name on the copyright page is transmitted from the client (100) to the server (115).  Additional images captured while the user is aligning the book in the onscreen preview (525) with the on-screen guidelines (515) may also be sent to the server for analysis and/or human review.  The images are processed by the server in the following ways:”).








Thus, one of ordinary skill in the art of computer requests and book-stores (or an “Amazon” “bookstore” via Rhoads:
“[0302] Consider a user located in a small bookstore who snaps a picture of the Warren Buffet biography Snowball.  The book is quickly recognized, but rather than presenting a corresponding Amazon link atop the list (as may occur with a regular Google search), the cell phone recognizes that the user is located in an independent bookstore.  Context-based rules consequently dictate that it present a non-commercial link first.  Top ranked of this type is a Wall Street Journal review of the book, which goes to the top of the presented list of links Decorum, however, only goes so far.  The cell phone passes the book title or ISBN (or the image itself) to Google AdSense or AdWords, which identifies sponsored links to be associated with that object.  (Google may independently perform its own image analysis on any provided imagery.  In some cases it may pay for such cell phone-submitted imagery--since Google has a knack for exploiting data from diverse sources.) Per Google, Barnes and Noble has the top sponsored position, followed by alldiscountbooks-dot-net.  The cell phone application may present these sponsored links in a graphically distinct manner to indicate their origin (e.g., in a different part of the display, or presented in a different color), or it may insert them alternately with non-commercial search results, i.e., at positions two and four.  The AdSense revenue collected by Google can again be shared with the user, or with the user's carrier.”) 

can modify Rhoads’ request from the cloud resource to include the signature-marking verification to determine genuineness as shown in Hudson’s fig. 8:815: “determine if…genuine” and recognize that the modification is predictable or looked forward to because Hudson’s signature-marking verification is in response to market forces/piracy as represented in Hudson’s fig. 1:120: “consumers” (one of which is said “BOB”) that “resent the need to re-buy at full price” thus allowing “BOB” to purchase e-books at discount instead of being pirate-“BOB” via Hudson:
“Digital media content consumers (e.g. readers of eBooks or digital music listeners) generally resent the need to re-buy at full price an electronic copy of a physical work that they already own.  This resentment is evident in the profusion of "format shifting" of digital music from CDs to digital files (e.g. MP3s) for use on portable music players.”




Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition) as applied above further in view of Yang et al. (US Patent App. Pub. No.: US 2014/0250110 A1).
Regarding claim 6, Rhoads as combined discloses the method as in claim 1, wherein the object (said comprised by “objects in the captured frame”) is detected (said via “edge detection”) using a saliency (via said “objects in the captured frame” comprising said “salient” “feature” “metrics”) detection (via said “edge detection”) algorithm (said comprised by “elemental image processing routines”).
Rhoads does not teach, as indicated in bold above, the words “saliency detection algorithm”. Accordingly, Yang teaches the words “saliency detection algorithm” (via “saliency detection algorithm” via said:
[0016] Another visual feature component that contributes to image attractiveness estimation includes aesthetic sensitivity.  Aesthetic sensitivity represents a degree with which an image is said to be beautiful, clear, or appealing.  Aesthetic sensitivity of the image may be determined, for instance, by applying well know photography rules such as "the rule of thirds", simplicity, and visual weight.  The "rule of thirds" may be, for instance, extracted from an image by analyzing a subject's location relative to the 
overall image.  Meanwhile simplicity (i.e., achieving the effect of singling out an item from a surrounding) may be determined by analyzing a hue count of an image.  As an example, visual weight of an image may be captured by contrasting clarity of a subject region with a non-subject portion of the image.








“[0048] The visual analysis component 110 may analyze the perceptual quality of 
the labeled image by determining the brightness, the contrast, the colorfulness, the sharpness, and/or the blur of the labeled image 308.  In an example implementation, to determine the brightness and the contrast, the mean (brightness) and standard deviation (contrast) of pixel intensity in gray are analyzed, though other conventional techniques may also be employed.  Colorfulness may be determined by analyzing the mean and standard deviation of saturation and hue, or a contrast of colors, for example.  Meanwhile, sharpness may be determined by, for example, a mean and standard deviation of a Laplacian image normalized by local average luminance.  Blur may be determined by, for example, frequency distribution of an image transformed according to a Fast Fourier Transform (FFT).  In addition to analyzing perceptual quality features such as brightness, colorfulness, sharpness, and blur, the visual analysis component 100 may apply a saliency detection algorithm to the labeled image 308.  Saliency detection extracts features of objects in images that are distinct and representative.  For instance, the visual analysis component 100 may apply the saliency detection algorithm to extract features over the whole image with pixel values reweighted by a saliency map (e.g., an image of extracted saliency features indicating a saliency of a corresponding region or point).  Alternatively, the visual analysis component 110 may apply the saliency detection algorithm over a subject region in the image.  For instance, the subject region may be detected by a minimal bounding box that contains 90% mass of all saliency weights in order to determine lighting, color, and sharpness of the saliency map reweighted image.”

Thus, said one of ordinary skill in the art of saliency can modify Rhoads’ teaching of the saliency feature metrics with Yang’s saliency detection algorithm by applying the saliency detection algorithm to the image of Godzilla and recognize that the modification is predictable or looked forward to because “simplicity (i.e., achieving the effect of singling out an item from a surrounding)” (Yang, cited above) is achieved via the saliency detection algorithm; thus, Godzilla can be clearly singled out from an image.
	
	





Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition) as applied above further in view of IDS cited Hudson et al. (US Patent 9,195,819) as applied above further in view of IDS cited Kim et al. (US Patent 10,091,654).
Regarding claim 8, Rhoads as combined teaches the method as in claim 4, wherein the request (said via “request images” using a “camera set-up module” or “request a further set of input data” as modified via the combination) to acquire the object (said comprised by “objects in the captured frame”) is verified using a biometric sensor on the user device (said or “FIG 0”: box with buttons).
Thus, Rhoads as combined does not teach, as shown in bold above “the object is verified using a biometric sensor on the user device”. Accordingly, Kim teaches:
the object is verified using a biometric sensor (or “a second ECG sensor 1122 as an external biometric sensor”) on (as shown in fig. 11:1122 on 1150) the user device (or “an external authentication device 1150” via Kim, c.13,ll. 22-43:











“In another example, as shown in FIG. 11, a user authentication apparatus 1100 includes a first ECG sensor 1121 as a biometric sensor, and an external authentication device 1150, such as a smartphone or a tablet computer, includes a second ECG sensor 1122 as an external biometric sensor.  When an electrical contact between a touch display 1160 of the external user authentication apparatus 1150 and a body 1190 of the user authentication apparatus 1100 is formed, and a user touches the external biometric sensor 1122 and the body 1190 of the user authentication apparatus 1100 with both hands 1109, respectively, an electrical path passing through a heart of the user is formed.  A biometric sensor, for example, the first ECG sensor 1121, of the user authentication apparatus 1100 measures an ECG signal in response to the contact between the external biometric sensor 1122 and the body 1190 being sensed.  A processor of the user authentication apparatus 1100 verifies an identity of the user based on the ECG signal measured by the external biometric sensor 1122 and the biometric sensor 1121, and authenticates the user based on an identified signature and the verified identity.”).

Thus, one of ordinary skill in the art of hand-signatures can modify Rhoads’ said fig. 0:box with buttons and “BOB” and “Show Times” and said “request” as modified via the combination with Kim’s teaching of fig. 11 by attaching Kim’s fig. 11:1122 to said Rhoads’ said fig. 0:box with buttons and connecting Kim’s fig. 11:1190 to Roads’ fig. 0:box with buttons with the display of “BOB” and “Show Times” and recognize that the modification is predictable or looked forward to because the modification results in a “user authentication apparatus…having a relatively high security level by combining identification results” via Kim, c.14,ll. 4-15, thus providing a secure transaction when re-buying books, such as “the Warren Buffet biography Snowball”, Rhoads cited above in the rejection of claim 4, at said discount:
“For example, while the user writes with the body 1290 of the user authentication apparatus 1200, the user authentication apparatus 1200 performs signature identification 1210, fingerprint identification 1221, ECG identification 1222, and PPG identification 1223, and the external authentication device 1280 performs the voice identification 1281 and the face identification 1282.  The user authentication apparatus 1200 may provide an authentication solution having a relatively high security level by combining identification results.  In this example, the user authentication apparatus comprehensively utilizes the identifications, and assigns a weight to a situation with respect to each identification result.”

Suggestions
The suggestions in co-pending application 16/354,240 is no longer suggested in view of the above 35 USC 103 rejection: 
Claims 1,2,5,6 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over IDS cited Rhoads et al. (US Patent App. Pub. No.: US 2014/0080428 A1) in view of Oramas et al. (MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES) further in view of Xue et al. (Deep Texture Manifold for Ground Terrain Recognition).
Accordingly, Oramas teaches the disclosed delaminating (applicant’s disclosed advantage) or clustering as shown in fig. 2: “Particular of the t-SNE of randomly selected image vectors” (appears to be all retail) separating customer’s albums by genre or by face (country music) or by color (dance music) or by clothing (jazz) or by instrument (classical music); however, the difference being that the disclosed delamination or clustering is with respect to retail and non-retail instead of clustering by genre, face, color, clothing and instrument. Thus, applicant’s disclosed delaminating appears as an indication of non-obviousness in view of said Rhoads, Oramas and Xue.
Note that these suggestions are not provided with respect to overcoming 35 USC 101,112,102 and/or 103. These suggestion are mainly provided to seek out advantages in the disclosure regardless of 35 USC 101,112,102 and/or 103.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENNIS ROSARIO whose telephone number is (571)272-7397.  The examiner can normally be reached on Monday-Friday, 9AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on (571)272-7778.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/DENNIS ROSARIO/Examiner, Art Unit 2667 

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667