Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention

Claims 20-21 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 20160350336 A1; Checka; Neal et al. (hereinafter Checka)
Regarding claim 20, Checka teaches A method of searching a collection of objects based on visual and semantic similarity of unified representations of the collection of objects comprising the steps of: determining a unified descriptor for the search query, where the search query comprises both image 3D shape or object data and word or tag data; (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network.[0036] Prior to analyzing images represented by an image dataset, each image may be resized (or sub-window cropped) to a canonical size (e.g., 224.times.224). Each cropped image determining one or more objects in the collection of objects having a spatially close vector representation to the unified descriptor for the search query. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such 
Regarding claim 21, Checka teaches The method of claim 20, wherein the unified descriptor or vector representation of a shape is an average of one or more rendered views of the shape. (Checka [0016] FIGS. 7 and 8 are screenshots of re-ranking search results based on color and shape. [0053] The product discovery process enables a user (e.g., the customer) to visually browse a product inventory based on attributes computed directly from a specimen image. The process employs an algorithm that describes images with a multi-feature representation using visual qualities (e.g., image descriptors) such as color, shape and texture. Each visual 


Claim Rejections - 35 USC § 103

 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9,15 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over US 20030004966 A1; Bolle, Rudolf M. et al. (hereinafter Bolle) in view of US 20170337690 A1; ARTH; Clemens et al. (hereinafter Arth)
Regarding claim 1, Bolle teaches A method for combining image data … and tag data into a unified representation, comprising the steps of: determining a vector representation for the image data in a vector space of words; (Bolle [0028] The representation of a video segment is a vector of representations of the constituent frames in the form of an ordinal measure of a reduced intensity image of each frame. Before matching, the database is prepared for video sequence by computing the ordinal measure for each frame in each video segment in the database. Finding a match between some given action video sequence and the databases then amounts to determining a vector representation for the tag data in the vector space of words; (Bolle [0014] After text feature extraction, a new vector representation of each text item associated with the training data is then extracted in terms of how frequently each selected feature occurs in that item. The vector representation may be binary, simply indicating the presence or absence of each feature, or it may be numeric in which each numeric value is derived from a count of the number of occurrences of each feature.[0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisily performing business processes. First, for each media item, the accompanying text is represented by a and combining the vector representations by performing vector calculus. (Bolle [0055] There are feature-based approaches, too, that do not rely on word co-occurrence or correspondences, for example, Litman and Passoneau. Here a set of word features is developed. These features are derived from multiple knowledge sources: prosodic features, cue phrase features, noun phrase features, combined  one or more 3D shapes, determining a vector representation for the one or more 3D shapes in the vector shape of words (Arth [0035] At block 160, the method determines a dynamic representation from the camera pose estimate of the input image from block 115 and the 2.5D or 3D map /model from block 125. In one embodiment, the dynamic representation is compatible with the selected one or more static representations of block 130. For example, if the static representation is a depth map (e.g., depth map 139) the dynamic representation may be created as a matrix of depth values representing the distance of the objects in the model of block 125 to the camera pose from block 115. In one embodiment, when correlating to a static representation depth map or normal vector map, the dynamic 
Regarding claim 2, Bolle and Arth teach The method of claim 1, wherein semantically close words are mapped to spatially close vectors in the vector space of words. (Bolle [0013] The resolution of this issue can depend on the details of the supervised learning technique employed, but, in applications related to text, local dictionaries generally give better performance. There are a variety of criteria for judging relevance during feature extraction. A simple one is to use absolute or normalized frequency to compile a list of a fixed number n of the most frequent features for each category, taking into account the fact that small categories may be so underpopulated that the total number of features in them may be less than n. More sophisticated techniques for judging relevance involve the use of information-theoretic measures such as entropy or the use of statistical methods such as principal component analysis.  [0054] Ponte and Croft, use a similar technique, except that they "expand" each word in a partition by looking it up in a "thesaurus" and taking all of the words in the same concept group that the seed word was in. (This is an attempt to overcome co-ocurrence, or correspondence, failures due to the use of synonyms or hypernyms, when really the same underlying concept is being referenced.) Ponte and Croft bootstrap the correspondences by developing a document-specific thesaurus, using "local context analysis" of labeled documents. Then, to find the best co-occurence sub-matrices, instead of exhaustively considering all possibilities, they use a dynamic programming technique, minimizing a cost function. Kozima et al. perform a similar word "expansion," by means of "spreading activation" in a linguistic semantic net. Two words are considered to be co-occurrences of, or corresponding to, each other if and only if each can be 
Regarding claim 3, Bolle and Arth teach The method of claim 1, wherein the vector representations are determined by: determining a vector for each of the image data in the vector space of words; (Bolle [0028] Mohan defines that there and embedding the image data and tag data in the vector space of words. (Bolle [0080]  Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways, such as (1) with respect to range of the values appearing in the two kinds of vectors or (2) with respect to the competitive sizes of the two kinds of vectors with respect to some norm or measure. The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions.  [0106] FIG. 15 shows the process of combining visual and textual feature vectors to obtain a vector representing the disparate sources of information in the media item. [0126] In 430, the system constructs a single vector representation of the visual features extracted or associated with each media item in D. In 440, for each labeled media item in the data set D, the system constructs a training set T(D) by combining the two vector representations of that media segment (constructed in 420 and 430) into a single composite feature vector, with the resulting vector labeled by the same set of classes used to label the media item. Optionally, in 440, before combining the vector representations [0160] the visual feature of average optical flow in key intervals can be ordered from high to low in 
Regarding claim 4, Bolle and Arth teach The method of claim 1, wherein the step of combining the vector representations by performing vector calculus comprises determining linear combinations of the vector representations in the vector space of words. (Bolle [0008] To develop a procedure for identifying media items as belonging to particular classes or categories, (or for any classification or pattern recognition task, for that matter) supervised learning technology can be based on decision trees, on logical rules, or on other mathematical techniques such as linear discriminant methods (including perceptron, support vector machines, and related variants), nearest neighbor methods, Bayesian inference, etc. We can generically refer to the output of such supervised learning systems as classifiers.   [0187] Given the combined feature vectors F(t), i.e., the vector representing the visual information F.sub.v(t) combined with the vector representing the textual information F.sub.t(t), each block can be classified into a category. One way to achieve this is to use a classifier to categorize every block independently using the combined feature vector of the block. A series of heuristic 
Regarding claim 5, Bolle and Arth teach The method of claim 1, wherein the image data or tag data comprises one or more weights. (Bolle [0001] This invention relates to the business of handling multimedia information (media items), such as video and images that have audio associated with it or possibly have text associated with it in the form of captions. More specifically, the invention relates to the business of handling video and audio by processing the video and audio for supervised and unsupervised machine learning of categorization techniques based on disparate information sources such as visual information and speech transcript. The invention also relates to combining these disparate information sources in a coherent fashion to make business decisions. [0011] From these feature vectors, the computer induces classifiers based on patterns or properties that characterize when a media segment belongs to a particular category. The term "pattern" is meant to be very general. These patterns or properties may be presented as rules, which may sometimes be easily understood by a human being, or in other, less accessible formats, such as a weight vector and threshold used to partition a vector space with a hyperplane. Exactly what constitutes a pattern or property in a classifier depends on the particular machine learning technology employed....[0039] Reference Smith et al. is incorporated herein in its entirety. A sophisticated video database browsing systems is described, the authors refer to browsing as "skimming." Much emphasis is placed on visual analysis for video interpretation and video summarization (the 
Regarding claim 6, Bolle and Arth teach The method of claim 1, further comprising performing the step of determining a vector representation for the image data using a neural network, optionally a convolutional neural network. (Bolle [0001] This invention relates to the business of handling multimedia information (media items), such as video and images that have audio associated with it or possibly have text associated with it in the form of captions. More specifically, the invention relates to the business of handling video and audio by processing the video and audio for supervised and unsupervised machine learning of categorization techniques based on disparate information sources such as visual information and speech transcript. The invention also relates to combining these disparate information sources in a coherent fashion to make business decisions.[0006] Multimedia collections may also be categorized based on data content, such as the amount of green or red in images or video and sound frequency 
Regarding claim 7, Bolle and Arth teach The method of claim 6, further comprising generating the neural network by an image classifier followed by an image encoder operable to generate embeddings in the vector space of words. (Bolle [0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisily performing business processes. First, for each media item, the accompanying text is represented by a sparse textual feature vector. Secondly, for each media item, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature. The extracted features are coarsely quantized. Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways... [0160] FIG. 13D shows a preferred specific methods for determining a visual feature vectors F.sub.v from the visual part of the media item 1310. The key frames or key intervals 1356, 1353, 1360, . . . , 1359 are selected from a media item. However, by 
Regarding claim 8, Bolle and Arth teach The method of claim 7, wherein the classifier is operable to be trained to identify image labels. (Bolle [0058] In sum, we can (roughly) distinguish the following approaches to media item categorization and media item subject detection; or, more generally, media item classification. The approaches are classified based on the features that are used. The features are derived from the raw analog signal, visual features computed from digitized media items frames ( images), textual features directly decoded from the closed-caption, and textual features obtained from automatically computed speech transcripts. Here is a list of common kinds of features used to classify multimedia items:... [0122] This system categorizing media items has two distinct aspects. The first aspect is called the training phase which builds representations of the reference media items; the second phase is called the categorization phase, where instances media items are categorized. The training phase is an off-line process that involves processing of the reference media items to form a set of one or more categories. The categorization phase classifies a media item in a collection of such items by processing the media 
Regarding claim 9, Bolle and Arth teach The method of claim 7, further comprising converting the image classifier to an encoder operable to generate semantic-based descriptors. ( Bolle  [0006] Multimedia collections may also be categorized based on data content, such as the amount of green or red in images or video and sound frequency components of audio segments. The media item collections have to be then preprocessed and the results have to be somehow categorized based on the visual properties. Categorizing media items based on semantic content, the actual meaning (subjects and objects) of the media items, on the other hand, is a difficult issue. For video, speech may be categorized or recognized to some extent, but beyond that, the situation is much more complicated because of the rudimentary state of the art in machine-interpretation of visual data. [0126] FIG. 4, flowchart 400, shows, when supervised learning is employed, the complete integration of disparate media modules in the learning phase, i.e., the induction from labeled data of a classifier whose purpose is media item categorization. In the initial step 410, the system accepts as input a data set D 
Regarding claim 15, Bolle and Arth teach The method of claim 1, further comprising: calculating, in advance of a query being received and/or processed, one or more descriptors; (Bolle [0036] Herein, first fifteen labels defined based on these visual features (by text, the authors, mean superimposed text in the video) are defined, examples are "talking head" and "one text line." A technique using Hidden Markov models (HMM) is described to classify a given media item into predefined categories, namely, commercial, news, sitcom and soap. An HMM takes these labels as input and has observation symbols as output. The system consists of two phases, a training and a classification stage. Reference Dimitrova et al. is incorporated herein in its entirety. [0221] vi. The automatic generation of MPEG-7 descriptors, as defined by the International Organisation for Standardisation/Organisation Internationale de Normalisation, ISO/IEC JTC1/SC29/WG11 specification "Coding of Moving Pictures and Audio." These descriptors are metadata items (digitally encoded annotations) which would be embedded in the bitstreams of videos (television; movies), sometime between the time of content creation ("filming" or "capture") and the time of broadcast/release. These metadata items are then available to all downstream processes (post-Receiving a query regarding the image data and/or the tag data; and providing a unified representation in relation to the query. ( Bolle [0039] Reference Smith et al. is incorporated herein in its entirety. A sophisticated video database browsing systems is described, the authors refer to browsing as "skimming." Much emphasis is placed on visual analysis for video interpretation and video summarization (the construction of two-dimensional depictions of the video to allow for nonlinear access). Visual analysis include scene break detection, camera motion analysis, and object detection (faces and superimposed text). The audio transcript is used to identify keywords in it. Term frequency inverse document frequency techniques are used to identify critical words. Words that appear frequently in a particular video segment but occur infrequently in standard corpuses receive the highest weight. In Smith et al. the speech recognition is not automated yet, and closed-captioning is used instead. Video search is accomplished through the use of the extracted words as search keys, browsing of video summaries then allows for quickly finding the video of interest [0221] The automatic generation of 
 Regarding claim 17, Bolle and Arth teach The method of claim 15, wherein: the image data comprises one or more embedded images; the one or more descriptors are calculated in relation to each of the one or more embedded images; (Bolle [0036] Herein, first fifteen labels defined based on these visual features (by text, the authors, mean superimposed text in the video) are defined, examples are "talking head" and "one text line." A technique using Hidden Markov models (HMM) is described to classify a given media item into predefined categories, namely, commercial, news, sitcom and soap. An HMM takes these labels as input and has observation symbols as output. The system consists of two phases, a training and a classification stage. Reference Dimitrova et al. is incorporated herein in its entirety.[0145] Categorization segments of a media item M, is aided by using some color quantization, for example, the following frame color codes. The color space of frames ( images) has been extensively used for indexing and searching based on image content. Application of hue color code, a preferred embodiment of this invention, is comprised of a number of steps (see FIG. 11A).  [0221] vi. The automatic generation of MPEG-7 descriptors, as defined by the International Organisation for Standardisation/Organisation Internationale de Normalisation, ISO/IEC JTC1/SC29/WG11 specification "Coding of Moving Pictures and Audio." These descriptors are metadata items (digitally encoded annotations) which would be embedded in the bitstreams of videos (television; movies), sometime between the time of content creation ("filming" or "capture") and the time of broadcast/release. These metadata items are then available to all downstream processes (post-production/editing stages of preparation of the complete video product, distribution channels for movie releases, or by receivers/viewers of the broadcast), for various purposes, in particular, retrieval from video archives by content-based querying (in other words, facilitating the finding of video clips of and a shape descriptor is calculated according to an average of the one or more descriptors in relation to each of the one or more embedded images, optionally wherein the average is biased according to one or more of the weights. (Bolle [0023]This can be a repetitive pattern of primitives (texels), or, can be more random, i.e., structural textures and statistical textures. Computational texture measures are either region-based or edge-based, trying to capture structural textures and statistical textures, respectively. In "VeggieVision" to Bolle et al., a texture representation of an image, image class, or image category, then, is a one-dimensional histogram of local texture feature values. Shape can also be represented in terms of frequency distribution. The information available to work with is the two-dimensional boundary of (say) a segmented image. Boundary shape is a feature of multiple boundary pixels and is expressed by a local computational feature, for example, curvature. Local curvature is estimated by fitting a circle at each point of the boundary. After smoothing, this boundary shape feature is quantized and a histogram is computed. Instead of over an area, such as for color histograms, these histograms are computed from a collection of image pixels that form the boundary of the object image. Finally, size of image segments is another feature of the images that is important in "VeggieVision" to Bolle et al. A method that 
Claims 10 and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Arth and US 20160350336 A1; Checka; Neal et al. (hereinafter Checka)
Regarding claim 10, Bolle and Arth teach The method of claim 6			Bolle lacks explicitly teaching wherein the neural network comprises one or more fully-connected layers.											However Checka teaches wherein the neural network comprises one or more fully-connected layers. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new 
Regarding claim 12, the combination of Bolle, Arth and Checka teach The method of claim 10, wherein one or more parameters of the one or more fully-connected layers are updated to minimize a total Euclidean loss. (Checka [0039] The features output by the convolution neural network may be tailored to new image 
Regarding claim 13, the combination of Checka, Arth and Bolle teach The method of claim 12, further comprising calculating the total Euclidean loss through consideration of the smallest Euclidean difference between two points. (Checka [0039] The features output by the convolution neural network may be tailored to new image search tasks and domains using a visual similarity learning algorithm. Provided labeled similar and dis-similar image pairs, this is accomplished by adding a 
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Checka, Arth and US 9792492 B2; Soldevila; Albert Gordo et al. (hereinafter Soldevila).
Regarding claim 11, The combination of Bolle and Soldevila teach The method of claim 10, wherein the one or more fully-connected layers are operable to return a vector												The combination explicitly teaching return a vector of the same dimensionality of return a vector of the same dimensionality of the image data (Soldevila [Col. 8 lines 57-64] At each fully-connected layer of the sequence 86, the input vector 106, 112, 114 is converted to an output vector 112, 114, 116, which may have the same or fewer dimensions (or in some cases, more dimensions). The output 116 of the final fully-connected layer 92 is used to generate the set of predictions 60. Each prediction is a class probability for a respective one of the classes in the set of classes. )										Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all prior methods and make the addition of Soldevila neural network methods in order to further improve the image classification abilities of the system ( Soldevila [Col. 2, lines 30-35] The present system and method provide an efficient way to use ConvNets for generating representations that are particularly useful for computing similarity. [Col. 4, lines 10-15] The neural network-based gradient representation can lead to consistent improvements with respect to alternative methods that represent an image using only quantities computed during the forward pass.)			
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Arth and Chavez; Alexander Kikuta et al.; US 10685057 B1 (hereinafter Chavez)
Regarding claim 14, Bolle and Arth teach The method of claim 6, wherein the neural network 												but lacks explicitly teaching is operable to minimize a softmax loss				However Chavez teaches is operable to minimize a softmax loss (Chavez 
Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Checka in view of US 20030004966 A1; Bolle, Rudolf M. et al. (hereinafter Bolle)
Regarding claim 22, Checka teaches A method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional determining a vector representation for the query; determining … which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern combining the vector representations by performing vector calculus…utilizing said combined vector representation (Bolle [0055] a set of word features is developed. These features are derived from multiple knowledge sources: prosodic features, cue phrase features, noun phrase features, combined features.  [0080] The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions. [0106] FIG. 15 shows the process of combining visual and textual feature vectors to obtain a vector representing the disparate sources of information in the media item. [125-126 & 172-176] further elaborates on the process of having vectors/representations of the image and textual 
Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over Checka in view of Bolle and US 20170221243 A1; Bedi; Ajay et al.
Regarding claim 23, Checka teaches The method of claim 22, 				But lacks explicitly teaching further comprising the step of replacing objects in an image with contextually similar objects.									However Bedi helps teach further comprising the step of replacing objects in an image with contextually similar objects (Bedi [0005] In addition to enabling removal of various objects from within digital photos, many conventional systems and methods replace the removed object with a replacement portion. For example, in response to removing an object from a digital photo, conventional systems and methods often replace the removed object with a replacement portion that includes similar features as the portion of the digital photo surrounding the removed object (e.g., using a digital photo fill process). In this way, a user can modify or otherwise edit a photo to remove an undesired object while maintaining continuity in the background of the digital photo. Nonetheless, removing and replacing objects within digital photos has various drawbacks and limitations with conventional systems and methods. [0006] In particular, where digital photos have lines, shapes, or other geometric features in the background of the digital photo, conventional systems and methods often fail to maintain a sense of continuity in the background after removing and replacing a removed object within the digital photo. For example, where a background of a digital photo includes lines that intersect a portion of an object to be removed, conventional systems and methods fail to 
Response to Arguments
Applicant's arguments filed 12/30/20201 have been fully considered
Claim objections: These issues have been resolved and the rejection has been withdrawn in light of the amendments and arguments. 
35 USC § 102 & 35 USC § 103: 
Regarding Applicant’s Argument (page(s): 7): “Claims 1-9, 15 and 17 are rejected under pre-AIA  35 U.S.C. §102(a)(2) as being anticipated by Bolle. Claim 1 has been amended to recite "[a] method for combining image data, one or more 3D shapes, and tag data" and "determining a vector representation for the one or more 3D shapes in the vector shape of words"Bolle, as may be understood, discloses the processing of image data into vectors, but does not disclose determining vector representations for one or more 3D shapes, and combining the vector representation with vector representations for image data and tag data. Claim 1 as amended is accordingly not anticipated by Bolle. The remaining claims 2-9, 15, and 17 each variously depend from independent claim 1, and so are allowable for at least the same reasons as claim 1. Accordingly, the Applicants respectfully request reconsideration of claims 1-9, 15 and 17 in view of the amendments.” Examiner’s response:
The current scope of the claim is not interpreted by the examiner as the applicant argues it should be viewed in the arguments, the examiner believes the applicant is assuming and placing too much weight from instant applications specification. The examiner believes these limitations assumed from the specification are not clear and must be brought into the claim’s limitations for the claim to gain the scope the applicant wishes it to have. The examiner believes bringing in amendments on the process to which the system is using to analyze input image and create the corresponding “representations” will help bring details that could overcome the current prior art. 

Conclusion
Applicant’s amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 




Any inquiry concerning this communication or earlier communications from the examiner should be directed to ARYAN D TOUGHIRY whose telephone number is (571)272-5212. The examiner can normally be reached Monday - Friday, 9 am - 5 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aleksandr Kerzhner can be reached on (571) 270-1760. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For 





/ARYAN D TOUGHIRY/Examiner, Art Unit 2165                                                                                                                                                                                                        
/William B Partridge/Primary Examiner, Art Unit 2183