DETAILED ACTION
Response to Amendment
The amendment was received 9/13/21. Claims 9,10,12-17,19 and 20 are pending.
Claim Interpretation

The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
Accordingly, 35 USC 112(f) is NOT invoked in claims 9,10,12-17,19 and 20. 



Accordingly the following definitions are “taken” via MPEP 2111.01 III. "PLAIN MEANING" REFERS TO THE ORDINARY AND CUSTOMARY MEANING GIVEN TO THE TERM BY THOSE OF ORDINARY SKILL IN THE ART, 3rd paragraph, emphasis added:
“It is also appropriate to look to how the claim term is used in the prior art, which includes prior art patents, published applications, trade publications, and dictionaries. Any meaning of a claim term taken from the prior art must be consistent with the use of the claim term in the specification and drawings. Moreover , when the specification is clear about the scope and content of a claim term, there is no need to turn to extrinsic evidence for claim interpretation. 3M Innovative Props. Co. v. Tredegar Corp., 725 F.3d 1315, 1326-28, 107 USPQ2d 1717, 1726-27 (Fed. Cir. 2013) (holding that "continuous microtextured skin layer over substantially the entire laminate" was clearly defined in the written description, and therefore, there was no need to turn to extrinsic evidence to construe the claim).”

The claimed “screenshot” (as in “the image is a screenshot captured by the user device from a display” of claim 9, 2nd limitation) is interpreted as one of skill in the art would in light of applicant’s disclosure under the broadest reasonable interpretation and definition thereof via Dictionary.com:
screenshot
noun
1	Also called screen cap·ture , screen·cap. 
a copy or image of what is seen on a computer monitor or other screen at a given time:
Save the screenshot as a graphics file.
verb (used with object) screen·shot or screen·shot·ted, screen·shot·ting.
2	to take a screenshot of:
You can screenshot the error message and send it to me.

BRITISH DICTIONARY DEFINITIONS FOR SCREENSHOT
screenshot
noun
1	an image created by copying part or all of the display on a computer screen at a particular moment, for example in order to demonstrate the use of a piece of software


The claimed “modal” (as in “the object to be identified is identified using multimodal learning techniques” in claim 9, 5th limitation) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com where “of or relating to…a particular type…of something” is “taken” as the meaning of the claim “modal” under MPEP 2111.01 III:
modal
adjective
1	of or relating to mode, manner, or form.

wherein “mode” is defined:
mode
noun
2	a particular type or form of something:
Heat is a mode of motion.

The claimed “learning” (as in “the object to be identified is identified using multimodal learning techniques” in claim 9, 5th limitation) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com where definition 5 is “taken” under MPEP 2111.01 III:
learn
verb (used with object), learned  [lurnd] or learnt  [lurnt], learn·ing  [lur-ning].
5	(of a device or machine, especially a computer) to perform an analogue of human learning with artificial intelligence.

wherein “artificial intelligence” is defined:
artificial intelligence
noun Computers.
1	a	the capacity of a computer, robot, or other programmed mechanical 
device to perform operations and tasks analogous to learning and decision making in humans, as speech recognition or question answering.
B	a computer, robot, or other programmed mechanical device having this humanlike capacity:
teaching human values to artificial intelligences.

The claimed “identified” (as in “superimposing the identified object onto the user” in claim 9, last limitation) is interpreted under the broadest reasonable interpretation in light of applicant’s disclosure such as via, emphasis added:…
“[0031] The object detection module 154 detects objects contained within the image data 114 received by the image capture module 150. The object identification module 156 detects identifiable objects contained within the image data 114. Further, the object identification module 156 may categorize the identifiable objects such as, but not limited to automobiles, consumer electronics, clothing, personal accessories, shoes, jewelry, and food, etc. The object identification module 156 may use one or more object recognition techniques such as, but not limited to, saliency detection and/or visual quantification to detect and categorize objects contained within the image data 114 received by the image capture module 150. For example, the object recognition technology may be, but not limited to, a trained object detection model. The trained object detection model may be generated using neural networks, including, but not limited to, deep convolutional neural networks, and deep recurrent neural networks. Deep convolutional neural networks are a class of deep, feed-forward artificial neural networks consisting of an input layer, an output layer, and multiple hidden layers used to analyze images. Deep recurrent neural networks are artificial neural networks wherein the connections between the nodes of the network form a directed graph along a sequence used for analyzing linguistic data. The object detection module 154 may input the image data 114 into the convolutional neural networks to generate the trained object detection model. The trained object detection model detects unique objects contained within the image data 114. As another example, the object recognition technology may include, but it not limited to, a saliency detection algorithm such as SalNet. SalNet is a deep learning algorithm which automatically detects salients for a given image such as an object contained within the image data 114. The saliency of an image is the state or quality by which it stands out relative to its neighbors, i.e. localizing what people see when they view the image. Saliency detection is considered to be a key attentional mechanism that facilitates learning and survival by enabling organisms to focus their limited perceptual and cognitive resources on the most pertinent subset of the available sensory data. Saliency detection stresses on four types of features, namely color, luminance, texture, and depth. In embodiments of the present invention, saliency detection concentrates primarily on static saliency and objectness. Static saliency detection algorithms use different image features that allow detecting salient object of a non-dynamic image and objectness estimation seeks to propose a small set of bounding boxes according to the possibility of a complete object existing around a region. 
[0032] The object identification module 156 identifies one or more individual objects detected by the object detection module 154. For example, the image data 114 may be, for example, but not limited to, a screenshot of a movie depicting an actor and the object identification module 156 may identify the individual pieces of clothing, jewelry, and/or accessories the actor is wearing or using in the image. The object identification module 156 may identify the one or more individual objects detected by the object detection module 154 by delaminating, i.e. separating, the image data 114 into retail, e.g. clothing, jewelry, personal electronics, and furniture, etc., and non-retail objects, e.g. people, animals, public and commercial services or facilities, etc. The object identification module 156 may utilize multi-modal learning to identify the one or more individual objects. For example, the multi-modal learning may include, but is not limited to, neural networks, background subtraction techniques, k-means algorithms, Barnes-Hut approximations, and/or t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.”
and definition thereof via Dictionary.com, wherein definitions 1-6 are equally applicable:
identify
verb (used with object), i·den·ti·fied, i·den·ti·fy·ing.
1	to recognize or establish as being a particular person or thing; verify the identity of:
to identify handwriting; to identify the bearer of a check.
2	to serve as a means of identification for:
His gruff voice quickly identified him.
3	to make, represent to be, or regard or treat as the same or identical:
They identified Jones with the progress of the company.
4	to associate in name, feeling, interest, action, etc. (usually followed by with):
He preferred not to identify himself with that group.
5	Biology. to determine to what group (a given specimen) belongs.
6	Psychology. to associate (one or oneself) with another person or a group of persons by identification.

wherein “serve” is defined:
serve
verb (used without object), served, serv·ing.
1	to act as a servant.
2	to wait on table, as a waiter.
3	to offer or have a meal or refreshments available, as for patrons or guests:
Come early, we're serving at six.
4	to offer or distribute a portion or portions of food or a beverage, as a host or hostess:
It was her turn to serve at the faculty tea.
5	to render assistance; be of use; help.








The claimed “each object” in claim 10, last line is interpreted under the broadest reasonable interpretation in light of applicant’s disclosure in the context of “multiple objects to be identified and located within the same image” via applicant’s disclosure:
[0002] Humans are capable of looking at an image or watching a video and readily identifying, people, objects, scenes, and other visual details. Object recognition has become an ever increasingly important facet of modern technology. Object recognition, with respect to technology, is a computer vision technique for identifying objects in images or videos. Object recognition techniques may use various means to identify objects such as deep learning and machine learning algorithms. Further, object recognition techniques may be combined with object detection techniques. Object detection and object recognition are similar techniques for identifying objects, but they vary in their execution. Object detection is the process of finding instances of objects in images. In the case of deep learning, object detection is a subset of object recognition, where the object is not only identified but also located in an image. This allows for multiple objects to be identified and located within the same image.

The claimed “request” (as in “a request to acquire the object from the device…from…the…source” in claim 12) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com, definitions 1-5 are equally applicable:
request, noun
1	the act of asking for something to be given or done, especially as a favor or courtesy; solicitation or petition:
At his request, they left.
2	an instance of this:
There have been many requests for the product.
3	a written statement of petition:
If you need supplies, send in a request.
4	something asked for:
to obtain one's request.
5	the state of being asked for; demand.





Response to Arguments
CLAIM OBJECTIONS
Applicant’s arguments, see remarks, page 7, filed 9/13/21, with respect to the claim objection have been fully considered and are persuasive.  The claim objection of claims 14 and 20 has been withdrawn. 
DOUBLE PATENTING
Applicants state in page 8:
“Applicant respectfully reiterates its request that this provisional rejection be held in abeyance until no other rejections remain, since the instant case as of the time of this paper does not include allowable claims, and since the claims in the instant case may be amended prior to allowance in such a way to obviate any such rejection.”

	In response, the double patenting rejection is maintained.

RESPONSE TO REJECTIONS UNDER 35 USC 103

Applicant’s arguments, see remarks, pages 9,10, filed 9/13/21, with respect to the rejection(s) of claim(s) 9,10,13,14,16,17 and 20 under 35 USC 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of 35 USC 103 in view of Liu et al. (Transductive Centroid Projection for Semi-supervised Large-Scale Recognition) that teaches using Barnes-Hut t-SNE as shown in page 76, fig. 2 being the inspiration to “boost” “recognition” via Liu, page 74:
“(2) A novel Transductive Centroid Projection layer - Based on the observation
above, we propose an innovative un/semi-supervised learning mechanism to
wisely integrate the unlabelled data into the recognition to boost its discriminative ability by introducing a new layer named as Transductive Centroid Projection (TCP). Without any iterative processing like self-training and label propagation, the proposed TCP can be simply trained and steadily embedded into arbitrary CNN structure with any classification loss.”

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 9,10,12-17,19 and 20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-8 of copending Application No. 16/460,286 (corresponding to IDS cited US Patent App. Pub. No.: US 2020/0293820 A1) in view of Rhoads et al. (US Patent 2014/0080428) and Ueda et al. (US Patent App. Pub. No.: US 2020/0027244 A1) and Liu et al. (Transductive Centroid Projection for Semi-supervised Large-Scale Recognition). 
For example, claim 9 claims “identify, by the computing device, one or more sources of the object in the image” and “generate, by the computing device, a third image” and co-pending claims 1 and 3 claims, respectively:
“identifying, by the computing device, one or more sources of the object in the image”; and
“generating, by the computing device, a second image”
as shown in bold below: 


1 (co-pending: 16/460,286). A method for object detection and identification, the method
comprising:
receiving, by a computing device, an image from a user device, wherein the image is screenshot captured by the user device from a display;
classifying, by the computing device, the image, wherein the image is classified based on features present in the image;
detecting, by the computing device, one or more objects contained within the image, wherein the one or more objects are delaminated into retail objects and non-retail objects; and  wherein each of the one or more the objects is a salient object;
identifying, by the computing device, each of the one or more objects in the image, wherein each of the one or more objects is identified using multimodal-learning techniques, and wherein the multi-modal learning techniques comprise a Barnes-Hut approximation; and
identifying, by the computing device, one or more sources of each of the one or more objects in the image.
3 (co-pending: 16/460,286). [[A]] The method as in claim 1, further comprising: 
receiving, by the computing device, a second image, the second image being an image of the user, from the user device; and 
generating, by the computing device, a second image of a user with at least one of the one or more objects, wherein the second image is generated using at least one convolutional neural network.

Co-pending: 16/460,286 claim 1 does not teach the claimed:
A.	“computer program product comprising:
a computer-readable storage medium having program instructions”; 
B.	“program instructions to receive, by the computing device, a second image, the second image being an image of the user, from the user device; 
program instructions to generate, by the computing device, a third image, wherein the third image is generated by superimposing the identified object onto the user depicted in the received second image, and wherein the third image is generated using at least one convolutional neural network.”; and
C.	wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identify the object”.
	Accordingly, Rhoads teaches:
A.	a computer-readable storage medium (fig. 81:546: “MEMORY”) having program instructions (via fig. 81: “Op. sys. UI SW Modules Etc.”).
	Thus one of ordinary skill in the art of computing devices can modify co-pending claim 1:16/460,286 with Rhoads’ teaching of fig. 81:546:“MEMORY: Op. sys. UI SW Modules Etc.” and recognize that the modification is predictable or looked forward to because “memory” is “all well and good” and favorably “memory…on…‘the cloud’ ” can “do…heavy lifting” via Rhoads:  
“[0104] It is all well and good to get better CPUs and GPUs, and more memory, on mobile devices.  However, cost, weight and power considerations seem to favor 
getting "the cloud" to do as much of the "intelligence" heavy lifting as 
possible.”



Thus, the combination does not teach limitations B and C:
B.	“program instructions to receive, by the computing device, a second image, the second image being an image of the user, from the user device; and 
program instructions to generate, by the computing device, a third image, wherein the third image is generated by superimposing the identified object onto the user depicted in the received second image, and wherein the third image is generated using at least one convolutional neural network.”; and
C.	“wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identify the object”.














Accordingly, Ueda teaches:
B.	program instructions (via fig. 8) to receive, by the computing device (fig. 1:12), a second image (via fig. 3A-C:30: “FIRST SUBJECT IMAGE”), the second image (said via fig. 3A-C:30: “FIRST SUBJECT IMAGE”) being an image of the user (or “a user”), from the user device (said fig. 1:12); and 
program instructions (said via fig. 8) to generate, by the computing device (said fig. 1:12), a third image (via fig. 5D:60), wherein the third image (said via fig. 5D:60) is generated by superimposing (such that “the second subject image 40 is superimposed on the first subject image 30”) the identified (via “identification information”) object (said “the second subject image 40”) onto the user (said or “a user”) depicted in the received second image (said via fig. 3A-C:30: “FIRST SUBJECT IMAGE”), and wherein the third image (said via fig. 5D:60) is generated (via fig. 2:26(21): “DISPLAY UNIT”) using at least one convolutional neural network (or “a convolutional neural network (CNN)” represented in fig. 2:35: “LEARNING MODEL” via:
“[0034] The terminal device 12 is a terminal device operated by a user.  Examples of the terminal device 12 include a known portable terminal and a smartphone.  In the present embodiment, the terminal device 12 is operated by a first subject.”;

“[0060] The second subject image 40 is a photographed image of the second 
subject.  The second subject is a user different from the first subject.  The second subject image 40 is preferably an image including the face and clothing of the second subject.  Similarly to the first subject, the second subject may be an organism or a non-organism such as a mannequin.  The present embodiment will be described on assumption that the second subject is a person, for example.”;







“[0075] The supplementary information 46 is information related to the corresponding second subject image 40.  Examples of the supplementary information 46 include identification information of the second subject of the second subject image 40, words indicating hairstyle of the hair site of the second subject, information indicating a hairdresser capable of providing the hairstyle, the name of the item worn by the second subject, and the information indicating the shop that can provide the item.  An example of identification information of the second subject is a user name of the second subject.  These pieces of information may be information indicating a location (Uniform Resource Locator (URL)) on the Internet in which these pieces of information 
are stored.”;

“[0090] The learning model 35 may be learned by the learning unit 20E and stored beforehand in the storage unit 24.  In the present embodiment, the learning unit 20E learns the learning model 35 by machine learning using the training data 31.  Known methods may be used for machine learning.  For example, the learning unit 20E learns the learning model 35 by using deep learning using algorithms such as a convolutional neural network (CNN) and a recurrent neural network (RNN).”; and

“[0222] Moreover, it is assumed that the trial target site 41 is the clothing worn by the second subject, and the combining region 32 and the target region 42 are entire regions of the first subject image 30 and the second subject image 40 other than the trial target site 41.  In this case, the generation unit 20C can execute the generation processing to generate the combined image 60 in which the clothing of the second subject image 40 is superimposed on the first subject image 30.”).

Thus, one or ordinary skill in the art of generating images and CNNs can modify co-pending:16/460,286 claims 1’s and 3’s convolutional neural network with Ueda’s teaching of fig. 8 and the CNN by programing a computer accordingly and recognize that the modification is predictable or looked forward to for the same reasons regarding high-quality as discussed in the below rejection of claim 9 under 35 USC 103. 






Thus, the combination does not teach limitation C:
C.	“wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identify the object”.
Accordingly, Liu teaches C as shown in the below 35 USC 103 rejection of claim 9.
Thus as similarly discussed in the below 35 USC 103 rejection of claim 1, one of ordinary skill in the art of data visualization can modify co-pending claim 1’s Barnes-Hut with Liu’s fig. 4: “CNN” by:
a)	making co-pending claim 1’s identification of objects be as Liu’s fig. 4: “CNN”;

b)	visualizing or observing a facial recognition task inside Liu’s fig. 4: “CNN”:
b1)	running Barnes-Hut t-SNE regarding visualizing the recognition of faces via Liu’s fig. 2(c): “MS1M” comprising 1 million faces and 100,000 modes;

c)	being “Inspired by the observation” of 1 million faces and the 100,000 classes, Liu, cited below;	

d)	congregating “the unlabeled data into the recognition system” after being inspired; and

e)	recognizing that the modification is predictable or looked forward to because the modification is used “to enhance its discriminative ability” or enhance the recognition system’s (as shown in Liu fig. 4(a):“semi-supervised learning”) ability to discriminate between faces via multiple modes/classes/clusters so as to identify a face via Liu.
Thus, claims 10,12-17,19 and 20 are rejected under a similar analysis as done for claim 9.
This is a provisional nonstatutory double patenting rejection.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Regarding inquiry 4, see Suggestions.
Claims 9,10,13,14,16, 17 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rhoads et al. (US 2014/0080428 A1) in view of Ueda et al. (US Patent App. Pub. No.: US 2020/0027244 A1) and Elliot et al. (MULTILINGUAL IMAGE DESCRIPTION WITH NEURAL SEQUENCE MODELS) and van der Maaten (Accelerating t-SNE using Tree-Based Algorithms) and Liu et al. (Transductive Centroid Projection for Semi-supervised Large-Scale Recognition).



Regarding claim 9, Rhoads teaches a computer program product for object detection and identification, the computer program product comprising: 
a computer-readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions comprising: 
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to receive, by a computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), an image (via fig. 0: a sign of “GODZILLA!”) from a user device (fig. 0:box with buttons and “BOB” and “Show Times”), wherein the image (via said fig. 0: a sign of “GODZILLA!”) is a screenshot captured (as shown in fig. 0:box with buttons and “BOB” and “Show Times”) by the user device (said fig. 0:box with buttons and “BOB”) from a display (as indicated in fig. 0” “BOB” and “My Car” and “Show Times” for “GODZILLA!” and fig. 81: “Display”);
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to classify (via “classification”), by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), the image (said fig. 0:a sign of “GODZILLA!”), wherein the image (said fig. 0: sign of “GODZILLA!”) is classified based on features (via “features…by…classification”) present in the image (via said fig. 0: a sign of “GODZILLA!” via:
“[0543] Although GPS is gaining in camera-metadata-deployment, most imagery presently in Flickr and other public databases is missing geolocation info.  But GPS info can be automatically propagated across a collection of imagery that share visible features (by image metrics such as eigenvectors, color histograms, keypoint descriptors, FFTs, or other classification techniques), or that have a metadata match.”); 




program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to detect (via “edge detection”), by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), an object (or edge said via “edge detection”) contained within the image (said fig. 0: a sign of “GODZILLA!”), wherein the object (said edge said via “edge detection”) is a salient (via a “salient” “edginess…metric”) object (said edge said via “edge detection”:
“[0114] FIG. 6 takes a major step toward the concrete, sacrificing simplicity in the process.  Here we see a top portion labeled "Resident Call-Up Visual Processing Services," which represents all of the possible list of applications from FIG. 2 that a given mobile device may be aware of, or downright enabled to perform.  The idea is that not all of these applications have to be active all of the time, and hence some sub-set of services is actually "turned on" at any given moment.  The turned on applications, as a one-time configuration activity, negotiate to identify their common component tasks, labeled the "Common Processes Sorter"--first generating an overall common list of pixel processing routines available for on-device processing, chosen from a library of these elemental image processing routines (e.g., FFT, filtering, edge detection, resampling, color histogramming, log-polar transform, etc.).  Generation of corresponding Flow Gate Configuration/Software Programming information follows, which literally loads library elements into properly ordered places in a field programmable gate array set-up, or otherwise configures a suitable processor to perform the required component tasks.”;

“[0663] A fixed set of image assessment criteria can be applied to distinguish images in the three categories.  However, the detailed embodiment determines such criteria adaptively.  In particular, this embodiment examines the set of images and determines which image features/characteristics/metrics most reliably (1) group like-categorized images together (similarity); and (2) distinguish differently-categorized images from each other (difference).  Among the attributes that may be measured and checked for similarity/difference behavior within the set of images are dominant color; color diversity; color histogram; dominant texture; texture diversity; texture histogram; edginess; wavelet-domain transform coefficient histograms, and dominant wavelet coefficients; frequency domain transfer coefficient histograms and dominant frequency coefficients (which may be calculated in different color channels); eigenvalues; keypoint descriptors; geometric class probabilities; symmetry; percentage of image area identified as facial; image autocorrelation; low-dimensional "gists" of image; etc. (Combinations of such metrics may be more reliable than the characteristics individually.”; and

“[0664] One way to determine which metrics are most salient for these purposes is to compute a variety of different image metrics for the reference images.  If the results within a category of images for a particular metric are clustered (e.g., if, for place-centric images, the color histogram results are clustered around particular output values), and if images in other categories have few or no output values near that clustered result, then that metric would appear well suited for use as an image assessment criteria.  (Clustering is commonly performed using an implementation of a k-means algorithm.”);


program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to identify (such that “each object is identified”, cited below [0014]), by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), the object (said edge said via “edge detection”, cited above [0114]) in the image (said fig. 0: a sign of “GODZILLA!”), wherein the object (said or edge said via “edge detection”) to be identified is identified (via said such that “each object is identified”) using multi-modal (via “different image-processing modes” comprising “through facial recognition”, cited below [0269]) learning techniques (via “Artificial intelligence techniques” and “Known artificial intelligence systems and techniques” represented in fig. 14:14: “PROCESSING FOR OBJECT IDENTIFICATION” such as fig. 2: “FACIAL RECOGNITION” comprised by said “different image-processing modes”), and wherein the multi-modal learning techniques (said via fig. 14:14: “PROCESSING FOR OBJECT IDENTIFICATION” such as fig. 2: “FACIAL RECOGNITION” comprised by said “different image-processing modes”) comprise using a Barnes-Hut approximation to identify the object (via:
“[0014] Certain aspects of the technology detailed herein are introduced in FIG. 0.  A user's mobile phone captures imagery (either in response to user command, or autonomously), and objects within the scene are recognized.  Information associated with each object is identified, and made available to the user through a scene-registered interactive visual "bauble" that is graphically overlaid on the imagery.  The bauble may itself present information, or may simply be an indicia that the user can tap at the indicated location to obtain a lengthier listing of related information, or launch a related function/application.”); 






program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to identify (via a “source…file name”), by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), one or more sources (i.e. “resources...serve as sources” such that “Object identification events will… associate public domain information and social-web connections to” said “Show Times”) of the object (said edge said via “edge detection”) in the image (said fig. 0: a sign of “GODZILLA!”); and 
program instructions (said for fig. 3: “CPU” or fig. 81:542: “Processor”) to receive, by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), a second image (or “a subsequent frame”), the second image (or “a subsequent frame”) being an image of a user (said fig. 0: “BOB”), from the user device (said fig. 0:box with buttons and “BOB”); and  
program instructions (said for fig. 3: “CPU” or fig. 81:542: “Processor”) to generate, by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), a third image (comprised via said or “a subsequent frame”), wherein the third image (said comprised via said or “a subsequent frame”) is generated by superimposing (resulting in a “Superimposed… substitute image”) the identified object (said edge said via “edge detection”) onto the user (said fig. 0: “BOB”) depicted in the received second image (said or “a subsequent frame”), and wherein the third image (said comprised via said or “a subsequent frame”) is generated using at least one convolutional neural network (via:
“[0018] In early roll-out, the class of recognizable objects will be limited but useful.  Object identification events will primarily fetch and associate public domain information and social-web connections to the baubles.  Applications employing barcodes, digital watermarks, facial recognition, OCR, etc., can help support initial deployment of the technology.”;
“[0207] The camera stage can be incorporated into an iterative processing loop.  For example, to gain focus-lock, a packet may be passed from the camera to a processing module that assesses focus.  (Examples may include an FFT stage--looking for high frequency image components; an edge detector stage--looking for strong edges; etc. Sample edge detection algorithms include Canny, Sobel, and differential.  Edge detection is also useful for object tracking.) An output from such a processing module can loop back to the camera's controller module and vary a focus signal.  The camera captures a subsequent frame with the varied focus signal, and the resulting image is again provided to the processing module that assesses focus.  This loop continues until the processing module reports focus within a threshold range is achieved.  (The packet header, or a parameter in memory, can specify an iteration limit, e.g., specifying that the iterating should terminate and output an error signal if no focus meeting the specified requirement is met within ten iterations.)”; 
“[0269] While current cameras have picture-taking modes based on lens/exposure 
profiles (e.g., close-up, nighttime, beach, landscape, snow scenes, etc), imaging devices may additionally (or alternatively) have different image-processing modes.  One mode may be selected by the user to obtain names of people depicted in a photo (e.g., through facial recognition).  Another mode may be selected to perform optical character recognition of text found in an image frame.  Another may trigger operations relating to purchasing a depicted item.  Ditto for selling a depicted item.  Ditto for obtaining information about a depicted object, scene or person (e.g., from Wikipedia, a social network, a manufacturer's web site), etc. Ditto for establishing a ThinkPipe 
session with the item, or a related system.  Etc.” 

“[0472] Collections of publicly-available imagery and other content are becoming more prevalent.  Flickr, YouTube, Photobucket (MySpace), Picasa, Zooomr, FaceBook, Webshots and Google Images are just a few.  Often, these resources can also serve as sources of metadata--either expressly identified as such, or inferred from data such as file names, descriptions, etc. Sometimes geo-location data is also available.”; and
“[0707] Artificial intelligence techniques can be applied to the data-mining task.  One class of such techniques is natural language processing (NLP), a science that has made significant advancements recently.”
“[0887] As shown in FIG. 71, the alpha channel in this example conveys an edge-detected version of the user's image.  Superimposed over the child's head is a substitute image of the child's face.  This substitute image can be selected for its composition (e.g., depicting two eyes, nose and mouth) and better contrast.”



“[1153] Software instructions for implementing the detailed functionality can be readily authored by artisans, from the descriptions provided herein, e.g., written in C, C++, Visual Basic, Java, Python, Tcl, Perl, Scheme, Ruby, etc. Cell phones and other devices according to certain implementations of the present technology can include software modules for performing the different functions and acts.  Known artificial intelligence systems and techniques can be employed to make the inferences, conclusions, and other determinations noted above.”).  
	Thus, Rhoads does not teach, as indicated in bold above, the claimed:
A.	“superimposing the identified object”; and
B.	“the third image is generated using at least one convolutional neural network”.
C.	“wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identify the object”
Accordingly, Ueda teaches claim 9 of:
A.	superimposing (via fig. 8:S110: “GENERATE COMBINED IMAGE” resulting in 
the image of fig. 5D:60 that “is superimposed”) the identified (via “identification information”) object (or “the second subject image 40 is superimposed on the first subject image 30”); and
B.	the third image (said the image of fig. 5D:60) is generated using (via fig. 1:35: “learning model”) at least one convolutional neural network (or “a convolutional neural network (CNN)” via Ueda:
“[0075] The supplementary information 46 is information related to the corresponding second subject image 40.  Examples of the supplementary information 46 include identification information of the second subject of the second subject image 40, words indicating hairstyle of the hair site of the second subject, information indicating a hairdresser capable of providing the hairstyle, the name of the item worn by the second subject, and the information indicating the shop that can provide the item.  An example of identification information of the second subject is a user name of the second subject.  These pieces of information may be information indicating a location (Uniform Resource Locator (URL)) on the Internet in which these pieces of information 
are stored.”; 

“[0090] The learning model 35 may be learned by the learning unit 20E and stored beforehand in the storage unit 24.  In the present embodiment, the learning 
unit 20E learns the learning model 35 by machine learning using the training 
data 31.  Known methods may be used for machine learning.  For example, the 
learning unit 20E learns the learning model 35 by using deep learning using 
algorithms such as a convolutional neural network (CNN) and a recurrent neural 
network (RNN).”; and
“[0222] Moreover, it is assumed that the trial target site 41 is the clothing worn by the second subject, and the combining region 32 and the target region 42 are entire regions of the first subject image 30 and the second subject image 40 other than the trial target site 41.  In this case, the generation unit 20C can execute the generation processing to generate the combined image 60 in which the clothing of the second subject image 40 is superimposed on the first subject image 30.”).
Thus, one of ordinary skill in the art of computers and superposing images comprising a “composition”, Rhoads: cited above, can modify Rhoads’ teaching of superimposing face images, such as in BOB’s “subsequent frame”, with Ueda’s by superposing and aligning and resizing or scale Rhoads’ edges of said fig. 0: a sign of “GODZILLA!” onto the image of said BOB and recognize that the modification is predictable or looked forward to, because Ueda’s superposed or combined image is a “high-quality combined image 60” that adds “high-quality” or superior quality to Rhoads’ “composition” subsequent frame image showing a work of high-quality art that can be entitled or filed as “GODZILLA! BOB” via Ueda:
“[0205] This processing enables the information processing apparatus 10 to generate the high-quality corrected image 44 and store it in the storage unit 24.  Furthermore, with a capability of generating the combined image 60 using such a corrected image 44, the information processing apparatus 10 can provide the high-quality combined image 60.”








Thus, the combination does not teach said:

C.	“wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identity the object”.
	Accordingly, Elliot teaches:
C.	using (for “translation”) multi-modal learning techniques, and wherein the multi-modal (via “a multilingual multimodal image description model”) learning (via “ the learned embedding matrix Weh (Eqn 1)”) techniques (or methods via starting page 4, section “3 METHODOLGY” and ending with section “3.5 TRAINING AND OPTIMIZATION”) comprise (via a showing of the methodology via “illustrate”) using a Barnes-Hut approximation (or “Barnes-Hut t-SNE projections” that show the METHODOLOGY) to identify the object (via:
page 1, section 1 INTRODUCTION, 2nd paragraph:
“We introduce multilingual image description and present a multilingual multimodal
image description model for this task. Multilingual image description is a form of
visually-grounded machine translation, in which parallel sentences are grounded
against features from an image. This grounding can be particularly useful when the
source sentence contains ambiguities that need to be resolved in the target sentence.
For example, in the German sentence “Ein Rad steht neben dem Haus”, “Rad”
could refer to either “bicycle” or “wheel”, but with visual context the intended meaning
can be more easily translated into English. In other cases, source language features 
can be more precise than noisy image features, e.g. in identifying the difference 
between a river and a harbor.”;












pages 2,3:

“2.1 RECURRENT LANGUAGE MODEL (LM)

The core of our model is a Recurrent Neural Network model over word sequences, i.e.,
a neural language model (LM) (Mikolov et al., 2010). The model is trained to predict the
next word in the sequence, given the current sequence seen so far. At each timestep i
for input sequence w0...n, the input word wi , represented as a one-hot vector over the
vocabulary, is embedded into a highdimensional continuous vector using the learned 
embedding matrix Weh (Eqn 1). A nonlinear function f is applied to the embedding 
combined with the previous hidden state to generate the hidden state hi (Eqn 2). At the
output layer, the next word oi is predicted via the softmax function over the vocabulary 
(Eqn 3).”; and

page 7, section 5 DISCUSSION, 2nd paragraph:

“Qualitatively, we can illustrate this effect using Barnes-Hut t-SNE projections of the 
initial hidden representations of our models (van der Maaten, 2014). Figure 4 shows the
t-SNE projection of the example from Figure 7 using the initial hidden state of an En 
MLM (left) and the target side of the De MLM → En MLM (right). In the monolingual 
example, the nearest neighbours of the target image are desert scenes with groups of 
people. Adding the transferred source features results in a representation that places 
importance on the background, due to the fact that it is consistently mentioned in the 
descriptions. Now the nearest neighbours are images of mountainous snow regions
with groups of people.”










Thus, one of ordinary skill in the art of describing images and translating languages as indicated in Rhoads’:
1)	fig. 46A: “A”: such as “Top of the rock (4)” and fig. 46B: “B” “C” and “D”; and
2)	“[0248] Another service is digital watermark reading.  Another is optical character recognition (OCR).  An OCR service provider may further offer translation services, e.g., converting processed image data into ASCII symbols, and then submitting the ASCII words to a translation engine to render them in a different language.  Other services are sampled in FIG. 2.  (Practicality prevents enumeration of the myriad other services, and component operations, that may also be provided.)”

can modify Rhoads’ teaching of said such “each object is identified” with Elliot’s teaching of said methods via starting page 4, section “3 METHODOLGY” and ending with section “3.5 TRAINING AND OPTIMIZATION” by:
a)	performing Rhoads’ different-mode-artificial intelligence-object- identification/recognition as shown in fig. 2: “OBJECT RECOGNITION” and fig. 2: “TEXT OCR” and fig. 2: “FACIAL RECOGNITION”;
b)	making said Rhoads’ description of “Top of the rock (4)” be as Elliot’s fig. 1: 
“children sitting in a classroom” based on the identification/recognition; 
c)	making a corresponding image of “Top of the rock (4)”, as shown in Rhoads’s fig. 46A:top-left image: Rockefeller Center skating rink, be as Elliot’s image of fig. 1:kids sitting at desks in school;
d)	executing the multilingual multimodal language model of Elliot’s fig. 1 based on said “Top of the rock (4)” and the corresponding skating image; and



e)	recognizing that the modification is predictable or looked forward to because the modification results in a translation engine of “English and German…models that…outperform target monolingual image description models” with illustrations, as shown in the projections of Elliot’s fig. 4: “t-SNE embeddings illustrate the positive effect…”, that
1)	“allows for the development of…methods” (van der Maaten, cited below) resulting in a more effective state, such as Elliot’s methodology, and also  
2)	are “high…quality…whilst at the same time requiring few…computational resources” (van der Maaten, cited below) 
thus allowing said one of skill in the art to clearly, via the projection-embedding illustrations, understand “Qualitatively” (Elliot, cited above) the quality of the methodology corresponding to “improve mainly lower-quality sentences, indicating that our best models successfully combine multiple noise input modalities” (via Elliot, cited below) via:
Elliot, page 2, 1st full paragraph:
“In a series of experiments on the IAPR-TC12 dataset of images described in English and German, we find that models that incorporate source language features substantially outperform target monolingual image description models. The best English-language model improves upon the state-of-the art by 2.3 BLEU4 points for this dataset. In the first results reported on German image description, our model achieves a 8.8 Meteor point improvement compared to a monolingual image description baseline. The implication is that linguistic and visual features offer orthogonal improvements in
multimodal modelling (a point also made by Silberer & Lapata (2014) and Kiela & Bottou (2014)). The models that include visual features also improve over our translation baselines, although to a lesser extent; we attribute this to the dataset being exact translations rather than independently elicited descriptions, leading to high performance for the translation baseline. Our analyses show that the additional features improve mainly lower-quality sentences, indicating that our best models successfully combine multiple noisy input modalities.”

; and
van der Maaten:

page 32221:

“1. Introduction

Visual exploration is an essential component of data analysis, as it allows for the development of intuitions and hypotheses for the processes that generated the data. Visual analytics provides and develops approaches to obtain such understanding from complex data: it aims to develop methods that allow analysts to examine the processes underlying the data (Keim et al., 2010). Unfortunately, modern visual-analytics approaches are still largely based on traditional visualization techniques such as histograms, scatter plots, and parallel coordinate plots; see, e.g., Heer et al. (2010) for an overview of visualization techniques. The drawback of these visualization techniques is that they only facilitate the visualization of one or a few data variables at a time, which prohibits their use on large, high-dimensional data sets. In order to develop hypotheses about processes that generate a large number of variables, it is therefore necessary to perform an automatic analysis of the data before making visualizations. A popular way to perform such an automatic analysis is by learning a low-dimensional embedding of the data. In a low-dimensional embedding, each (high-dimensional) object is represented by a low-dimensional point in such a way, that nearby points correspond to similar objects and that distant points correspond to dissimilar objects. The low-dimensional embedding can readily be visualized in, e.g., a scatter plot or a parallel coordinate plot, or it can be used as the basis for the construction of more advanced visualizations, such as class-conditional density maps (van Eck and Waltman, 2010)
or hierarchical visualizations (Ti˜no and Nabney, 2002).”; and

page 3234, section: Experiment 1, last paragraph:

“The results presented in the figure highlight the merits of using tree-based t-SNE algorithms. In particular, the results show that Barnes-Hut t-SNE with θ = 0.5 and dual-tree t-SNE with θ = 0.2 lead to embeddings that are of the same quality as those obtained with standard t-SNE (when quality is measured in terms of nearest-neighbor errors in the embedding). At the same time, increasing the value of θ to these values leads to very substantial improvements in terms of the amount of computation required to construct the embedding: for example, Barnes-Hut t-SNE requires only 751 seconds to embed all 70, 000 MNIST digits when θ = 0.5, whereas the original t-SNE algorithm would have taken many days to complete. The results presented in the figure also suggest that dual-tree t-SNE has a slightly worse speed-accuracy trade-off than Barnes-Hut t-SNE: Barnes-Hut t-SNE with θ = 0.5 leads to an embedding of slightly higher quality than dual-tree t-SNE with θ = 0.2, whilst at the same time requiring fewer computational resources.”



However, the combination still does not teach said:

C.	“wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identity the object”.
Accordingly, Liu teaches:
wherein the multi (via “10” or “100” or “100,000”)-modal (comprising a particular type or “class”) learning (via figs.1,4: “learning”) techniques (i.e., clustering as shown in fig. 2) comprise using a Barnes-Hut approximation (via fig. 2:“(c) MS1M, where the features of CIFAR-100 and MS1M are visualized by Barnes-Hut t-SNE”) to identity the object (via fig. 4(a): right-side: “labeled data” via page 75:
“2.1 Toy Examples
To investigate the aforementioned observation from small-scale to large-scale tasks and from low dimensional to high dimensional latent space, we empirically analyze three tasks with different data scales, feature dimension and network structure, i.e. character classification on MNIST [33] with 10 classes, object classification on CIFAR-100 [34] with 100 classes, and face recognition on MS1M [35] with 100, 000 classes1. Table 1 records the detailed settings for these experiments. To each task, there are two FC layers after its backbone structure, in which FC1 learns an internal feature vector f and FC2 acts as the projection onto the class space. All tasks employ the softmax loss. Figure 2 depicts the feature spaces extracted from different datasets, in which the 2-D features in MNIST are directly plotted and the 128-D features in CIFAR-100 and MS1M are compressed by Barnes-Hut t-SNE [36].”).








Thus, one of ordinary skill in the art of recognition as taught by Rhoads and Liu can modify Rhoads’ fig. 2: “FACIAL RECOGNITION”, serving as the basis for language translation in the previous combination of Elliot and comprised by an image processing mode of multiple image processing modes such as a name from image mode and text recognition mode, with Liu’s fig. 4: “CNN” by:
a)	making Rhoads’ language translation of fig. 2: “FACIAL RECOGNITION”, comprised by an image processing mode of multiple modes, be as Liu’s fig. 4: “CNN”, comprised by an image processing mode of said multiple image processing modes;

b)	performing language translation via said fig. 2: “FACIAL RECOGNITION” being as Liu’s fig. 4: “CNN”;

c)	visualizing or observing the facial recognition task inside Liu’s fig. 4: “CNN”:
c1)	running Barnes-Hut t-SNE regarding visualizing the recognition of faces via Liu’s fig. 2(c): “MS1M” comprising 1 million faces and 100,000 modes;

d)	being “Inspired by the observation” of 1 million faces and the 100,000 classes, Liu, cited below;	

e)	congregating “the unlabeled data into the recognition system” after being inspired; and

f)	recognizing that the modification is predictable or looked forward to because the modification is used “to enhance its discriminative ability” or enhance the recognition system’s (as shown in Liu fig. 4(a):“semi-supervised learning”) ability to discriminate between faces via multiple modes/classes/clusters so as to identify a face during an image processing mode of multiple image processing modes and during language translation regarding people: Elliot’s page 14: people at beach via Liu, page 78, section 3 Approach:





“Inspired by the observation stated in the previous section, we propose a novel
learning mechanism to wisely congregate the unlabelled data into the recognition
system to enhance its discriminative ability. Let X L denote the labelled dataset
with M classes and X U the unlabelled dataset. We first cluster the X U by [24]
and get N clusters. According to the property wn ≈ cn discussed in the previous
section, the ad hoc centroid cU from an unlabelled cluster can be used to build
up the corresponding anchor vector wU, which means that it is possible to utilize
the ad hoc centroid for a faithful classification of the unlabelled cluster.”



















Regarding 10, Rhoads as combined teaches the computer program product as in claim 9, wherein the program instructions to identify, by the computing device, one or more sources of the object in the image further comprise: 
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to determine by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), a location (via fig. 2: left side: “WHAT’S NEARBY?”) of the user device (fig. 0:box with buttons and “BOB” and “Show Times”); 
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to generate, by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), a list (or “a ranked list”) of sources (said i.e. “resources...serve as sources” such that “Object identification events will… associate public domain information and social-web connections to” said “Show Times” represented as “links”) of the object (said edge said via “edge detection”) based on the location (said via fig. 2: left side: “WHAT’S NEARBY?”) of the user device (fig. 0:box with buttons and “BOB” and “Show Times” via:
“[0297] In one particular arrangement, visual "baubles" (FIG. 0) are overlaid on the captured imagery.  Tapping on any of the baubles pulls up a screen of information, such as a ranked list of links Unlike Google web search--which ranks search results in an order based on aggregate user data, the camera application attempts a ranking customized to the user's profile.  If a Starbucks sign or logo is found in the frame, the Starbucks link gets top position for this user.”); and 






program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to present (via “a screen of information”, cited above), by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), the list (or “a ranked list”) of sources (said i.e. “resources... serve as sources” such that “Object identification events will… associate public domain information and social-web connections to” said “Show Times” represented as “links”) of the object (said edge said via “edge detection”) to a user (said “BOB”) on the user device (fig. 0:box with buttons and “BOB” and “Show Times”), thereby allowing the user (said “BOB”) to compare respective conditions (said “Show Times”) and locations (said via fig. 2: left side: “WHAT’S NEARBY?”) for each object identified within the received image (or an object as shown by any one object in fig. 0 via said edge said via “edge detection”).  
Regarding claim 13, Rhoads as combined teaches the computer program product as in claim 9, wherein the screenshot is captured (as shown in fig. 0:box with buttons and “BOB” and “Show Times”) from the display (said as indicated in fig. 0” “BOB” and “My Car” and “Show Times” for “GODZILLA!” and fig. 81: “Display”) displaying at least one of the group consisting of: a movie (said “GODZILLA!”), a television program, and a commercial.  





Regarding claim 14, Rhoads as combined teaches the computer program product as in claim 9, wherein the multi-modal learning techniques (said via “Artificial intelligence techniques” and “Known artificial intelligence systems and techniques” represented in fig. 14:14: “PROCESSING FOR OBJECT IDENTIFICATION” such as fig. 2: “FACIAL RECOGNITION” comprised by said “different image-processing modes” as modified via the combination) further comprise at least one of the group consisting of: a neural network, a convolutional neural network (CNN), a background subtraction technique, a k-means algorithm (or “k-means algorithm”), [[,]] and a t-Distributed Stochastic Neighbor Embedding (t- SNE) (said via “Artificial intelligence techniques” and “Known artificial intelligence systems and techniques” represented in fig. 14:14: “PROCESSING FOR OBJECT IDENTIFICATION” such as fig. 2: “FACIAL RECOGNITION” comprised by said “different image-processing modes” as modified via the combination via:
[0664] One way to determine which metrics are most salient for these purposes is to compute a variety of different image metrics for the reference images.  If the results within a category of images for a particular metric are clustered (e.g., if, for place-centric images, the color histogram results are clustered around particular output values), and if images in other categories have few or no output values near that clustered result, then that metric would appear well suited for use as an image assessment criteria.  (Clustering is commonly performed using an implementation of a k-means algorithm.)”).  





Regarding claim 16, claim 16 is rejected the same as claim 9. Thus, argument presented in claim 9 is equally applicable to claim 16.
Regarding claim 17, claim 17 is rejected the same as claim 10. Thus, argument presented in claim 10 is equally applicable to claim 17.
Regarding claim 20, claim 20 is rejected the same as claim 14. Thus, argument presented in claim 14 is equally applicable to claim 20.

















Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rhoads et al. (US 2014/0080428 A1) in view of Ueda et al. (US Patent App. Pub. No.: US 2020/0027244 A1) and Elliot et al. (MULTILINGUAL IMAGE DESCRIPTION WITH NEURAL SEQUENCE MODELS) and van der Maaten (Accelerating t-SNE using Tree-Based Algorithms) and Liu et al. (Transductive Centroid Projection for Semi-supervised Large-Scale Recognition) as applied above further in view of Hudson et al. (US Patent 9,195,819).
Regarding claim 12, Rhoads as combined teaches the computer program product as in claim 9, further comprising:
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to receive, by the computing device (said fig. 3: “CPU” or fig. 81:542: “Processor”), a request (or “request”) to acquire the object (said object via said edge said via “edge detection” or “a further set of input data”) from the user device (said fig. 0:box with buttons and “BOB” and “Show Times”) from one (via “the cloud resource”) of the one or more sources  
(said i.e. “resources... serve as sources” such that “Object identification events will… associate public domain information and social-web connections to” said “Show Times” represented as “links” via:
“[0451] In turn, the cloud resource may alert the cell phone of any information it expects might be requested from the phone in performance of the expected operation, or action it might request the cell phone to perform, so that the cell phone can similarly anticipate its own forthcoming actions and prepare accordingly.  For example, the cloud process may, under certain conditions, request a further set of input data, such as if it assesses that data originally provided is not sufficient for the intended purpose (e.g., the input data may be an image without sufficient focus resolution, or not enough contrast, or needing further filtering).  Knowing, in advance, that the cloud process may request such further data can allow the cell phone to consider this possibility in its own operation, e.g., keeping processing modules configured in a certain filter manner longer than may otherwise be the case, reserving an interval of sensor time to possibly capture a replacement image, etc.”);
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to verify, by the computer device (said fig. 3: “CPU” or fig. 81:542: “Processor”), the request (said “request”) to acquire the object (said object via said edge said via “edge detection” or “a further set of input data”, cited above); and 
program instructions (for fig. 3: “CPU” or fig. 81:542: “Processor”) to send, by the computer device (said fig. 3: “CPU” or fig. 81:542: “Processor”), the request (said “request”) to acquire the object (said object via said edge said via “edge detection”) to the source (said “the cloud resource”).  
Thus, Rhoads does not teach, as shown in bold above:
“program instructions  to verify, by the computer device, the request”.













Accordingly, Hudson teaches:
program instructions (as shown in fig. 8) to verify (via “verifying ownership” via fig. 8:815: “determine if…genuine”), by the computer device (via fig. 1:100 and 115), the request (as “requested by the server” represented in fig. 8 as back-arrows between fig.8:810 back-to 805, a signature marking step, and fig.8:840 back-to said 805 via c.14,ll. 32-58:
“The next step in verifying ownership of a physical book is that the server (115) sends a message to the client (100) to instruct the client operator (120) to mark their physical book in a specific way.  The marking is typically made in a permanent manner, for example using permanent ink.  In one embodiment of the invention, the server instructs the client to instruct the user to write their name on the physical book's copyright page (505).  Once the user has written his or her name on the physical book in the place requested by the server (e.g. the copyright page), the user is required to capture an image of the mark using the personal electronic device.  For example, using the capture image button (510) illustrated in FIG. 5, the user may capture an image of their name written on the copyright page.  The captured image may include the entire copyright page and the page facing the copyright page (525), for example.  Onscreen guidelines (515) and a live preview (525) from the smartphone's image sensor and flash (520) are provided to aid the user in aligning the physical book with the user's mark visible with the angle of imaging requested by the server.  This image of the physical book's copyright page and facing page with the user's hand written name on the copyright page is transmitted from the client (100) to the server (115).  Additional images captured while the user is aligning the book in the onscreen preview (525) with the on-screen guidelines (515) may also be sent to the server for analysis and/or human review.  The images are processed by the server in the following ways:”).







Thus, one of ordinary skill in the art of computer requests and book-stores (or an “Amazon” “bookstore” via Rhoads:
“[0302] Consider a user located in a small bookstore who snaps a picture of the Warren Buffet biography Snowball.  The book is quickly recognized, but rather than presenting a corresponding Amazon link atop the list (as may occur with a regular Google search), the cell phone recognizes that the user is located in an independent bookstore.  Context-based rules consequently dictate that it present a non-commercial link first.  Top ranked of this type is a Wall Street Journal review of the book, which goes to the top of the presented list of links Decorum, however, only goes so far.  The cell phone passes the book title or ISBN (or the image itself) to Google AdSense or AdWords, which identifies sponsored links to be associated with that object.  (Google may independently perform its own image analysis on any provided imagery.  In some cases it may pay for such cell phone-submitted imagery--since Google has a knack for exploiting data from diverse sources.) Per Google, Barnes and Noble has the top sponsored position, followed by alldiscountbooks-dot-net.  The cell phone application may present these sponsored links in a graphically distinct manner to indicate their origin (e.g., in a different part of the display, or presented in a different color), or it may insert them alternately with non-commercial search results, i.e., at positions two and four.  The AdSense revenue collected by Google can again be shared with the user, or with the user's carrier.”) 

can modify Rhoads’ request from the cloud resource to include the signature-marking verification to determine genuineness as shown in Hudson’s fig. 8:815: “determine if…genuine” and recognize that the modification is predictable or looked forward to because Hudson’s signature-marking verification is in response to market forces/piracy as represented in Hudson’s fig. 1:120: “consumers” (one of which is said “BOB”) that “resent the need to re-buy at full price” thus allowing “BOB” to purchase e-books at discount instead of being pirate-“BOB” via Hudson:
“Digital media content consumers (e.g. readers of eBooks or digital music listeners) generally resent the need to re-buy at full price an electronic copy of a physical work that they already own.  This resentment is evident in the profusion of "format shifting" of digital music from CDs to digital files (e.g. MP3s) for use on portable music players.”



Claims 15 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rhoads et al. (US 2014/0080428 A1) in view of Ueda et al. (US Patent App. Pub. No.: US 2020/0027244 A1) and Elliot et al. (MULTILINGUAL IMAGE DESCRIPTION WITH NEURAL SEQUENCE MODELS) and van der Maaten (Accelerating t-SNE using Tree-Based Algorithms) Liu et al. (Transductive Centroid Projection for Semi-supervised Large-Scale Recognition) as applied above further in view of Hudson et al. (US Patent 9,195,819) as applied above further in view of Kim et al. (US Patent 10,091,654).
Regarding claim 15, Rhoads as combined teaches the computer program product as in claim 12, wherein the request (said “request”) to acquire the object (said object via said edge said via “edge detection” or “a further set of input data”) is verified (via said “request” as modified via the combination) using a biometric sensor on the user device (said fig. 0:box with buttons and “BOB” and “Show Times”).  
Thus, Rhoads as combined does not teach, as shown in bold above:
“using a biometric sensor on the user device”.
Accordingly, Kim teaches:
using a biometric sensor (or “a second EDG sensor 1122 as an external biometric sensor”) on (as shown in fig. 11:1122 on 1150) the user device (or “an external authentication device 1150” via Kim, c.13,ll. 22-43:









“In another example, as shown in FIG. 11, a user authentication apparatus 1100 includes a first ECG sensor 1121 as a biometric sensor, and an external authentication device 1150, such as a smartphone or a tablet computer, includes a second ECG sensor 1122 as an external biometric sensor.  When an electrical contact between a touch display 1160 of the external user authentication apparatus 1150 and a body 1190 of the user authentication apparatus 1100 is formed, and a user touches the external biometric sensor 1122 and the body 1190 of the user authentication apparatus 1100 with both hands 1109, respectively, an electrical path passing through a heart of the user is formed.  A biometric sensor, for example, the first ECG sensor 1121, of the user authentication apparatus 1100 measures an ECG signal in response to the contact between the external biometric sensor 1122 and the body 1190 being sensed.  A processor of the user authentication apparatus 1100 verifies an identity of the user based on the ECG signal measured by the external biometric sensor 1122 and the biometric sensor 1121, and authenticates the user based on an identified signature and the verified identity.”).

Thus, one of ordinary skill in the art of hand-signatures can modify Rhoads’ said fig. 0:box with buttons and “BOB” and “Show Times” and said “request” as modified via the combination with Kim’s teaching of fig. 11 by attaching Kim’s fig. 11:1122 to said Rhoads’ said fig. 0:box with buttons and connecting Kim’s fig. 11:1190 to Roads’ fig. 0:box with buttons with the display of “BOB” and “Show Times” and recognize that the modification is predictable or looked forward to because the modification results in a “user authentication apparatus…having a relatively high security level by combining identification results” via Kim, c.14,ll. 4-15:
“For example, while the user writes with the body 1290 of the user authentication apparatus 1200, the user authentication apparatus 1200 performs signature identification 1210, fingerprint identification 1221, ECG identification 1222, and PPG identification 1223, and the external authentication device 1280 performs the voice identification 1281 and the face identification 1282.  The user authentication apparatus 1200 may provide an authentication solution having a relatively high security level by combining identification results.  In this example, the user authentication apparatus comprehensively utilizes the identifications, and assigns a weight to a situation with respect to each identification result.”

Regarding claim 19, claim 19 is rejected the same as claims 12 and 15. Thus, argument presented in claims 12 and 15 is equally applicable to claim 19.
Suggestions

Applicant’s disclosure states, emphasis added:
“[0016] Embodiments of the present invention provide a method, computer program, and computer system for detecting an object contained within an image and identifying the object along with sources of the object and/or sources related to the object. Embodiments of the present invention also provide a method, computer program, and computer system for displaying an object on an image of the user and providing a means for purchasing, renting, borrowing, or otherwise acquiring the object. More particularly, embodiments of the present invention receive an image from a display, analyze the image for retail or non-retail objects and generate a list of sources of the retail or non-retail objects for presentation to a user. Advantages of the invention over current technology include saliency detection of objects within images, delamination of retail objects and non-retail objects, multiple source identification based on location of a user, image creation of the user with the identified objects, and object acquisition verification using a biometric sensor.”

	Thus, as a whole (as indicated in suggested claim 9 below) “Advantages” ought  be apparent. In contrast, Rhoads uses filenames to identify sources of metadata of a truck website at [0472] and identifies a multiple-classified vehicle/truck image (a form of multi-modal classification wherein vehicle is one mode and truck is the other mode), via recognized letters in fig. 68: “GMC”, from a truck website source/resource [0875]. Thus, the truck website source/resource is used to specifically identify via make and model the multiple-classified truck after identifying the letters “GMC”. 
	In contrast, suggested claim 9 states the opposite:
“program instructions to identify, by the computing device, one or more sources of the identified object in the classified image”. 

Note that these suggestions are not provided with respect to overcoming 35 USC 101,112,102 and/or 103. These suggestion are mainly provided to seek out advantages in the disclosure regardless of 35 USC 101,112,102 and/or 103:






9. (Suggested & not searched: also see co-pending application 16/460,286: Office Action Appendix: 9/20/21: showing a similar suggested claim and corresponding interview summary of 9/20/21 regarding making any differences clear under 35 USC 103 as an indication of non-obviousness regarding the above opposite limitation) A computer program product for object detection and identification, the computer program product comprising: 

a computer-readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions comprising: 

program instructions to receive, by a computing device, an image from a user device, wherein the image is a screenshot captured by the user device from a display; 

program instructions to classify, by the computing device, the screenshot image, wherein the screenshot image is classified based on features present in the screenshot image; 

program instructions to detect, by the computing device, an object contained within the classified image, wherein the object is a salient object; 

program instructions to identify, by the computing device, the detected object in the classified image, wherein the detected object to be identified is identified using multi-modal learning techniques, and wherein the multi-modal learning techniques comprise using a Barnes-Hut approximation to identify the detected object; 

program instructions to identify, by the computing device, one or more sources of the identified object in the classified image; and 

program instructions to receive, by the computing device, a second image, the second image being an image of a user, from the user device; and 

program instructions to generate, by the computing device, a third image, wherein the third image is generated by superimposing the identified object onto the user depicted in the received second image, and wherein the third image is generated using at least one convolutional neural network.  





Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.                                                                                                     
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENNIS ROSARIO whose telephone number is (571)272-7397. The examiner can normally be reached Monday-Friday, 9AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DENNIS ROSARIO/Examiner, Art Unit 2667 

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667