Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103

 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 20 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over US 20160350336 A1; Checka; Neal et al. (hereinafter Checka) in view of Multi-view Convolutional Neural Networks for 3D Shape Recognition 9/27/2015; Hang; Su et al. (hereinafter Su). 
Regarding claim 20, Checka teaches A method of searching a collection of objects based on visual and semantic similarity of unified representations of the collection of objects comprising the steps of: determining a unified descriptor for the search query, where the search query comprises both image 3D shape or object data and word or tag data; (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network.[0036] Prior to analyzing images represented by an image dataset, each image may be resized (or sub-window cropped) to a canonical size (e.g., 224.times.224). Each cropped image may be fed through the trained network, and the output at the first fully connected layer is extracted. The extracted output may be a 4096 dimensional feature vector representing the image and may serve as a basis for the image analysis. To facilitate this, well-established open-source libraries such as, but not limited to, LIBSVM and FLANN (Fast Library for Approximate Nearest Neighbors) may be used. [0037] In order to handle geometric variations in images, a spatial transformer may be used. The spatial transformer module may result in models which learn translation, scale and rotation invariance. A spatial transformer is a module that learns to transformer feature maps within a network that correct spatially manipulated data without supervision. A description of spatial transformer networks can be found in the following publication: M. Jaderberg K. Simonyan A. Zisserman K. Kavukcuoglu, "Spatial Transformer Networks", Advances in Neural Information Processing Systems 28 (NIPS), 2015, which publication is hereby incorporated herein by reference to its entirety. A spatial transformer may help localize objects, normalizing them spatially for better classification and representation for visual search. [53-58] further elaborate on the system’s ability to query using image/tag data in process that involves finding similar data/vector [FIG. 1 & 14] show a visual of the system capable of querying using image/tag data in process that involves finding similar data/vector )								determining one or more objects in the collection of objects having a spatially close vector representation to the unified descriptor for the search query. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network.[0036] Prior to analyzing images represented by an image dataset, each image may be resized (or sub-window cropped) to a canonical size (e.g., 224.times.224). Each cropped image may be fed through the trained network, and the output at the first fully connected layer is extracted. The extracted output may be a 4096 dimensional feature vector representing the image and may serve as a basis for the image analysis. To facilitate this, well-established open-source libraries such as, but not limited to, LIBSVM and FLANN (Fast Library for Approximate Nearest Neighbors) may be used. [0037] In order to handle geometric variations in images, a spatial transformer may be used. The spatial transformer module may result in models which learn translation, scale and rotation invariance. A spatial transformer is a module that learns to transformer feature maps within a network that correct spatially manipulated data without supervision. A description of spatial transformer networks can be found in the following publication: M. Jaderberg K. Simonyan A. Zisserman K. Kavukcuoglu, "Spatial Transformer Networks", Advances in Neural Information Processing Systems 28 (NIPS), 2015, which publication is hereby incorporated herein by reference to its entirety. A spatial transformer may help localize objects, normalizing them spatially for better classification and representation for visual search. [53-58] further elaborate on the systems ability to query using image/tag data in process that involves finding similar data/vector [FIG. 1 & 14] show a visual of the system capable of querying using image/tag data in process that involves finding similar data/vector )				Checka lacks explicitly and orderly teaching wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, computing a descriptor for each view, and averaging the descriptors for each view.							However Su teaches wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, (Su [Section 2.0 para. 4 , 3.1 para. 2-3] show the different viewpoints for 3-D shapes [FIG. 5 and 7] show a visual of the viewpoints for 3D shapes)									computing a descriptor for each view (Su [section 3.0 para. 1-3, section 3.2 para. 1-6] show the descriptors [FIG. 6 and 7] show visual)					and averaging the descriptors for each view ( Su [ section 3.0 para. 2, section  3.2 para.5-6] show averaging the descriptors for the view  [Table 1-2 and eq. 1] show visual chart and equation for averaging)							Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all prior methods and make the addition of Su's 3D shape recognition methods in order to create a system with higher computational efficiency (Su [section 3.0, para. 3, section 5, para. 2] improvements in computational efficiency due to corresponding methods)
Regarding claim 21, Checka teaches The method of claim 20, wherein the unified descriptor or vector representation of a shape is an average of one or more rendered views of the shape. (Checka [0016] FIGS. 7 and 8 are screenshots of re-ranking search results based on color and shape. [0053] The product discovery process enables a user (e.g., the customer) to visually browse a product inventory based on attributes computed directly from a specimen image. The process employs an algorithm that describes images with a multi-feature representation using visual qualities (e.g., image descriptors) such as color, shape and texture. Each visual quality (e.g., color, shape, texture, etc.) is weighted independently. For example, a color attribute can be defined as a set of histograms over the Hue, Saturation and Value (HSV) color values of the image. These histograms are concatenated into a single feature vector: Similarly, shape can be represented using shape descriptors such as a histogram of oriented gradients (HOG) or Shape Context.[0054] The shape and color feature vectors may then each be normalized to unit norm, and weighted and concatenated into a single feature vector: [0055] Feature comparison between the concatenated vectors may be accomplished via distance metrics such as, but not limited to, Chi Squared distance or Earth Mover's Distance to search for images having similar visual attributes: The weighting parameter (w) reflects the preference for a particular visual attribute. This parameter can be adjusted via a user-interface that allows the user to dynamically adjust the weighting of each feature vector and interactively adjust the search results based on their personal preference. FIGS. 7 and 8 illustrate screenshot examples of re-ranking search results based on color and shape. In FIG. 7, weighting preference is on shape over color. In FIG. 8, weighting preference is on color over shape. [0056] Visual Exemplars: On e-commerce websites, product images within a search category may be displayed in an ad-hoc or random fashion. For example, if a user executes a text query, the images displayed in the image carousel are driven by a keyword-based relevancy, resulting in many similar images. In contrast, the methods of the present disclosure may analyze the visual features/image descriptors (e.g., color, shape, texture, etc.) to determine "exemplar images" within a product category. An image carousel populated with "exemplar images" better represents the breadth of the product assortment. The term "exemplar image" may be defined as being at the "center of the cluster" of relevant image groups. For example, an exemplar image may be an image that generally exemplifies features of other images in a grouping; thus, the exemplar image is an exemplary one of the images in the grouping [0078-80] further elaborate)
Claims 1-9,15 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over US 20030004966 A1; Bolle, Rudolf M. et al. (hereinafter Bolle) in view of US 20170337690 A1; ARTH; Clemens et al. (hereinafter Arth) and Multi-view Convolutional Neural Networks for 3D Shape Recognition; 9/27/2015; Hang; Su et al. (Hereinafter Su).
Regarding claim 1, Bolle teaches A method for combining image data … and tag data into a unified representation, comprising the steps of: determining a vector representation for the image data in a vector space of words; (Bolle [0028] The representation of a video segment is a vector of representations of the constituent frames in the form of an ordinal measure of a reduced intensity image of each frame. Before matching, the database is prepared for video sequence by computing the ordinal measure for each frame in each video segment in the database. Finding a match between some given action video sequence and the databases then amounts to sequentially matching the input sequence against each sub-sequence in the database and detecting minimums. This method introduces the temporal aspects of the video media items.   [0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisely performing business processes. First, for each media item, the accompanying text is represented by a sparse textual feature vector. Secondly, for each media item, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature. The extracted features are coarsely quantized. Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways, such as (1) with respect to range of the values appearing in the two kinds of vectors or (2) with respect to the competitive sizes of the two kinds of vectors with respect to some norm or measure. The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions. [125-126 & 172-176] further elaborates on the process of having vectors/representations of the image and textual data [FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)		determining a vector representation for the tag data in the vector space of words; (Bolle [0014] After text feature extraction, a new vector representation of each text item associated with the training data is then extracted in terms of how frequently each selected feature occurs in that item. The vector representation may be binary, simply indicating the presence or absence of each feature, or it may be numeric in which each numeric value is derived from a count of the number of occurrences of each feature.[0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisely performing business processes. First, for each media item, the accompanying text is represented by a sparse textual feature vector. Secondly, for each media item, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature. The extracted features are coarsely quantized. Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways, such as (1) with respect to range of the values appearing in the two kinds of vectors or (2) with respect to the competitive sizes of the two kinds of vectors with respect to some norm or measure. The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions.  [125-126 & 172-176] further elaborates on the process of having vectors/representations of the image and textual data [FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)			and combining the vector representations by performing vector calculus. (Bolle [0055] There are feature-based approaches, too, that do not rely on word co-occurrence or correspondences, for example, Litman and Passoneau. Here a set of word features is developed. These features are derived from multiple knowledge sources: prosodic features, cue phrase features, noun phrase features, combined features. A decision tree, expressed in terms of these features, is then evaluated at each potential discourse segment boundary to decide if it is truly a discourse segmentation point or not. The decision expression can be hand-crafted or automatically produced by feeding training data to a learning system such as the well-known C4.5 decision tree classification scheme   [0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisily performing business processes. First, for each media item, the accompanying text is represented by a sparse textual feature vector. Secondly, for each media item, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature. The extracted features are coarsely quantized. Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways, such as (1) with respect to range of the values appearing in the two kinds of vectors or (2) with respect to the competitive sizes of the two kinds of vectors with respect to some norm or measure. The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions. [0106] FIG. 15 shows the process of combining visual and textual feature vectors to obtain a vector representing the disparate sources of information in the media item. [125-126 & 172-176] further elaborates on the process of having vectors/representations of the image and textual data [FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)			Bolle lacks explicitly teaching determining a vector representation for the one or more 3D shapes in the vector shape of words							Arth helps teach Input image data with one or more 3D shapes, determining a vector representation for the one or more 3D shapes in the vector shape of words (Arth [0035] At block 160, the method determines a dynamic representation from the camera pose estimate of the input image from block 115 and the 2.5D or 3D map /model from block 125. In one embodiment, the dynamic representation is compatible with the selected one or more static representations of block 130. For example, if the static representation is a depth map (e.g., depth map 139) the dynamic representation may be created as a matrix of depth values representing the distance of the objects in the model of block 125 to the camera pose from block 115. In one embodiment, when correlating to a static representation depth map or normal vector map, the dynamic representation may also be a depth map or normal vector map such that depth is correlated with depth, or normal vectors with normal vectors. In other embodiments, the dynamic representation is a representation which may be correlated with image classes 131, planar structures 135, line features 137, or other static representations that may be determined in block 130. In some embodiments, the device creates a dynamic representation from the model and 6DOF pose for visualization purposes (e.g., to display on a device or output to an application or program [2 & 45] further elaborates)	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all and make the addition of Arth in order to create a dynamic representation and ultimately improve the image analysis and output result of the system. (Arth [0049] At block 325, the method updates the dynamic representation according to the adjusted 6DOF pose. In one embodiment a 6DOF pose may be determined with a minimum amount of globally available input information, such as a 2D map and some building height information (e.g., as provided by a 2.5D untextured map). The building height information may be estimated from the input image scene or determined from other sources. In some embodiments, method 300 may utilize more detailed and accurate models and semantic information for enhanced results. For example, within an AR system synergies can be exploited for annotated content to be visualized which may be used as feedback into the method 300 to improve localization performance. For example, using the AR annotations of windows or doors can be used in connection to a window detector to add another semantic class to a scoring function. Therefore, certain AR content might be used to improve localization performance within method 300's framework.)						the combination lack explicitly and orderly teaching wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, computing a descriptor for each view, and averaging the descriptors for each view.				However Su teaches wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, (Su [Section 2.0 para. 4 , 3.1 para. 2-3] show the different viewpoints for 3-D shapes [FIG. 5 and 7] show a visual of the viewpoints for 3D shapes)									computing a descriptor for each view (Su [section 3.0 para. 1-3, section 3.2 para. 1-6] show the descriptors [FIG. 6 and 7] show visual)					and averaging the descriptors for each view ( Su [ section 3.0 para. 2, section  3.2 para.5-6] show averaging the descriptors for the view  [Table 1-2 and eq. 1] show visual chart and equation for averaging)							Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all prior methods and make the addition of Su's 3D shape recognition methods in order to create a system with higher computational efficiency (Su [section 3.0, para. 3, section 5, para. 2] improvements in computational efficiency due to corresponding methods)
Regarding claim 2, Bolle, Su and Arth teach The method of claim 1, wherein semantically close words are mapped to spatially close vectors in the vector space of words. (Bolle [0013] The resolution of this issue can depend on the details of the supervised learning technique employed, but, in applications related to text, local dictionaries generally give better performance. There are a variety of criteria for judging relevance during feature extraction. A simple one is to use absolute or normalized frequency to compile a list of a fixed number n of the most frequent features for each category, taking into account the fact that small categories may be so underpopulated that the total number of features in them may be less than n. More sophisticated techniques for judging relevance involve the use of information-theoretic measures such as entropy or the use of statistical methods such as principal component analysis.  [0054] Ponte and Croft, use a similar technique, except that they "expand" each word in a partition by looking it up in a "thesaurus" and taking all of the words in the same concept group that the seed word was in. (This is an attempt to overcome co-ocurrence, or correspondence, failures due to the use of synonyms or hypernyms, when really the same underlying concept is being referenced.) Ponte and Croft bootstrap the correspondences by developing a document-specific thesaurus, using "local context analysis" of labeled documents. Then, to find the best co-occurence sub-matrices, instead of exhaustively considering all possibilities, they use a dynamic programming technique, minimizing a cost function. Kozima et al. perform a similar word "expansion," by means of "spreading activation" in a linguistic semantic net. Two words are considered to be co-occurrences of, or corresponding to, each other if and only if each can be reached from the other by less than m steps in the semantic net, for some arbitrarily chosen value of m.  [0123] FIG. 1 shows a prior art flowchart for a system 100 for categorizing text documents. In step 110, a set of text documents is input to the system. Each text document is labeled as belong to a class S=c.sub.1, i=1, . . . , C. The classes S can be hierarchical, in the sense that each class S, can be recursively divided up into a number of subclasses, S=S.sub.subclass1, S.sub.subclass2, . . . , S.sub.subclassN. In 120 a single vector is computing representing the text in each document in D. Such a vector V is a large-dimensional vector with entry n equal to 1 or 0, respectively, if word n is present, or not, in the document; or, such a vector V can be a large-dimensional vector with entry n equal to f where f is the number of times word n is present in the document. Examples of source of text vectors 120 include: close captions, open captions, captions, speech recognition applied to one or more audio input, semantic meanings derived from one or more audio streams, and global text information associated with the media item. In step 130 (FIG. 1) each vector V is labeled the same as the corresponding document. Step 140 induces machine-learned classification methods for classifying unseen vectors V representing new unclassified documents. Finally, Box 150, infers classification method to classify (categorize) unknown documents D, represented by feature vectors V.  [31,33, and 172] further elaborate on the use of semantics in recognizing text )
Regarding claim 3, Bolle, Su and Arth teach The method of claim 1, wherein the vector representations are determined by: determining a vector for each of the image data in the vector space of words; (Bolle [0028] Mohan defines that there is a match between a given video sequence and some segment of a database video sequence if each frame in the given sequence matches the corresponding frame in the database video segment. That is, the matching sequences are of the same temporal length; matching slow-motion sequences is performed by temporal sub-sampling of the database segments. The representation of a video segment is a vector of representations of the constituent frames in the form of an ordinal measure of a reduced intensity image of each frame. [0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques...  [0125] FIG. 3 shows a more specific way of processing the multimedia media item for categorization purposes. The input data, media item (300), is processed separately in terms of the visual track, 305, and the audio track, 310. The visual track 305 and the audio track are processed independently and concurrently. From the visual track (305), characteristic key frames or key intervals are selected 320. These characteristic pieces of video are transformed into characteristic visual spaces 330 that in some way characterize the video clip in terms of visual features associated with the video categories. These visual space representations are transformed into sparse visual feature vectors (335). [126 and 172-174] further elaborate on the vectors for image data)								and embedding the image data and tag data in the vector space of words. (Bolle [0080]  Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways, such as (1) with respect to range of the values appearing in the two kinds of vectors or (2) with respect to the competitive sizes of the two kinds of vectors with respect to some norm or measure. The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions.  [0106] FIG. 15 shows the process of combining visual and textual feature vectors to obtain a vector representing the disparate sources of information in the media item. [0126] In 430, the system constructs a single vector representation of the visual features extracted or associated with each media item in D. In 440, for each labeled media item in the data set D, the system constructs a training set T(D) by combining the two vector representations of that media segment (constructed in 420 and 430) into a single composite feature vector, with the resulting vector labeled by the same set of classes used to label the media item. Optionally, in 440, before combining the vector representations [0160] the visual feature of average optical flow in key intervals can be ordered from high to low in 1320. The effect and purpose of this ordering is that the temporal information in the video stream is discarded. (This is analogous to discarding word location information in using word frequency vectors, e.g., F.sub.t, in text document analysis.) The visual feature value codes (quantized feature values) in the regions of the key frames or intervals are then mapped 1365 into a first visual feature vector F.sub.v 1370. Assume that there are n key frames or intervals in media item 1310 with W regions per key frame this gives a feature vector...[172-174] further elaborate on the vectors for image data)
Regarding claim 4, Bolle, Su and Arth teach The method of claim 1, wherein the step of combining the vector representations by performing vector calculus comprises determining linear combinations of the vector representations in the vector space of words. (Bolle [0008] To develop a procedure for identifying media items as belonging to particular classes or categories, (or for any classification or pattern recognition task, for that matter) supervised learning technology can be based on decision trees, on logical rules, or on other mathematical techniques such as linear discriminant methods (including perceptron, support vector machines, and related variants), nearest neighbor methods, Bayesian inference, etc. We can generically refer to the output of such supervised learning systems as classifiers.   [0187] Given the combined feature vectors F(t), i.e., the vector representing the visual information F.sub.v(t) combined with the vector representing the textual information F.sub.t(t), each block can be classified into a category. One way to achieve this is to use a classifier to categorize every block independently using the combined feature vector of the block. A series of heuristic rules such as described in FIGS. 20A and 20B can then be used to aggregate the categorization and more accurately determine the category boundaries. [FIG. 13, 21] shows a visual of the combination vectors )
Regarding claim 5, Bolle, Su and Arth teach The method of claim 1, wherein the image data or tag data comprises one or more weights. (Bolle [0001] This invention relates to the business of handling multimedia information (media items), such as video and images that have audio associated with it or possibly have text associated with it in the form of captions. More specifically, the invention relates to the business of handling video and audio by processing the video and audio for supervised and unsupervised machine learning of categorization techniques based on disparate information sources such as visual information and speech transcript. The invention also relates to combining these disparate information sources in a coherent fashion to make business decisions. [0011] From these feature vectors, the computer induces classifiers based on patterns or properties that characterize when a media segment belongs to a particular category. The term "pattern" is meant to be very general. These patterns or properties may be presented as rules, which may sometimes be easily understood by a human being, or in other, less accessible formats, such as a weight vector and threshold used to partition a vector space with a hyperplane. Exactly what constitutes a pattern or property in a classifier depends on the particular machine learning technology employed....[0039] Reference Smith et al. is incorporated herein in its entirety. A sophisticated video database browsing systems is described, the authors refer to browsing as "skimming." Much emphasis is placed on visual analysis for video interpretation and video summarization (the construction of two-dimensional depictions of the video to allow for nonlinear access). Visual analysis include scene break detection, camera motion analysis, and object detection (faces and superimposed text). The audio transcript is used to identify keywords in it. Term frequency inverse document frequency techniques are used to identify critical words. Words that appear frequently in a particular video segment but occur infrequently in standard corpuses receive the highest weight. In Smith et al. the speech recognition is not automated yet, and closed-captioning is used instead. Video search is accomplished through the use of the extracted words as search keys, browsing of video summaries then allows for quickly finding the video of interest )
Regarding claim 6, Bolle, Su and Arth teach The method of claim 1, further comprising performing the step of determining a vector representation for the image data using a neural network, optionally a convolutional neural network. (Bolle [0001] This invention relates to the business of handling multimedia information (media items), such as video and images that have audio associated with it or possibly have text associated with it in the form of captions. More specifically, the invention relates to the business of handling video and audio by processing the video and audio for supervised and unsupervised machine learning of categorization techniques based on disparate information sources such as visual information and speech transcript. The invention also relates to combining these disparate information sources in a coherent fashion to make business decisions.[0006] Multimedia collections may also be categorized based on data content, such as the amount of green or red in images or video and sound frequency components of audio segments. The media item collections have to be then preprocessed and the results have to be somehow categorized based on the visual properties. Categorizing media items based on semantic content, the actual meaning (subjects and objects) of the media items, on the other hand, is a difficult issue. For video, speech may be categorized or recognized to some extent, but beyond that, the situation is much more complicated because of the rudimentary state of the art in machine-interpretation of visual data.  [0008] To develop a procedure for identifying media items as belonging to particular classes or categories, (or for any classification or pattern recognition task, for that matter) supervised learning technology can be based on decision trees, on logical rules, or on other mathematical techniques such as linear discriminant methods (including perceptrons, support vector machines, and related variants), nearest neighbor methods, Bayesian inference, etc. We can generically refer to the output of such supervised learning systems as classifiers.[0011] From these feature vectors, the computer induces classifiers based on patterns or properties that characterize when a media segment belongs to a particular category. The term "pattern" is meant to be very general. These patterns or properties may be presented as rules, which may sometimes be easily understood by a human being, or in other, less accessible formats, such as a weight vector and threshold used to partition a vector space with a hyperplane. Exactly what constitutes a pattern or property in a classifier depends on the particular machine learning technology employed.)
Regarding claim 7, Bolle, Su and Arth teach The method of claim 6, further comprising generating the neural network by an image classifier followed by an image encoder operable to generate embeddings in the vector space of words. (Bolle [0080] In the learning phase, off-line supervised or unsupervised learning, using a training set of labeled or unlabeled multimedia items as appropriate, is employed to induce a classifier based on patterns found in a unified representation as a single feature vector of disparate kinds of features, linguistic and visual, found in a media item under consideration. The unified representation of disparate features in a single feature vector will enable a classifier to make use of more complicated patterns for categorization, patterns that simultaneously involve linguistic and visual aspects of the media, resulting in superior performance as compared with other less sophisticated techniques. This allows for better designed and more precisily performing business processes. First, for each media item, the accompanying text is represented by a sparse textual feature vector. Secondly, for each media item, a set of key frames or key intervals (key intervals, for short) is determined, which can either be regularly sampled in time or based on the information content. From each key interval, a set of features is extracted from a number of regions in the key intervals. These regions can be different for each feature. The extracted features are coarsely quantized. Hence, each key interval is encoded by a sparse textual feature vector and a sparse visual feature vector. The sparse textual feature vectors and the sparse visual feature vectors may optionally need to be further transformed to assure their compatibility in various ways... [0160] FIG. 13D shows a preferred specific methods for determining a visual feature vectors F.sub.v from the visual part of the media item 1310. The key frames or key intervals 1356, 1353, 1360, . . . , 1359 are selected from a media item. However, by rearranging these key frames or intervals, 1383, 1386, etc., these key frames or intervals are ordered into a new sequence 1320 of key intervals 1353, 1356, 1359, . . . , 1360 according to the value of visual feature that is to be encoded in the visual feature vector F.sub.v. For example, the visual feature of average frame brightness can be ordered from high to low in the sequence 1320. Or, the visual feature of average optical flow in key intervals can be ordered from high to low in 1320.[FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)
Regarding claim 8, Bolle, Su and Arth teach The method of claim 7, wherein the classifier is operable to be trained to identify image labels. (Bolle [0058] In sum, we can (roughly) distinguish the following approaches to media item categorization and media item subject detection; or, more generally, media item classification. The approaches are classified based on the features that are used. The features are derived from the raw analog signal, visual features computed from digitized media items frames ( images), textual features directly decoded from the closed-caption, and textual features obtained from automatically computed speech transcripts. Here is a list of common kinds of features used to classify multimedia items:... [0122] This system categorizing media items has two distinct aspects. The first aspect is called the training phase which builds representations of the reference media items; the second phase is called the categorization phase, where instances media items are categorized. The training phase is an off-line process that involves processing of the reference media items to form a set of one or more categories. The categorization phase classifies a media item in a collection of such items by processing the media item to extract audio and visual features and using the media item class representations. [0126] In 430, the system constructs a single vector representation of the visual features extracted or associated with each media item in D. In 440, for each labeled media item in the data set D, the system constructs a training set T(D) by combining the two vector representations of that media segment (constructed in 420 and 430) into a single composite feature vector, with the resulting vector labeled by the same set of classes used to label the media item. [FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)
Regarding claim 9, Bolle, Su and Arth teach The method of claim 7, further comprising converting the image classifier to an encoder operable to generate semantic-based descriptors. ( Bolle  [0006] Multimedia collections may also be categorized based on data content, such as the amount of green or red in images or video and sound frequency components of audio segments. The media item collections have to be then preprocessed and the results have to be somehow categorized based on the visual properties. Categorizing media items based on semantic content, the actual meaning (subjects and objects) of the media items, on the other hand, is a difficult issue. For video, speech may be categorized or recognized to some extent, but beyond that, the situation is much more complicated because of the rudimentary state of the art in machine-interpretation of visual data. [0126] FIG. 4, flowchart 400, shows, when supervised learning is employed, the complete integration of disparate media modules in the learning phase, i.e., the induction from labeled data of a classifier whose purpose is media item categorization. In the initial step 410, the system accepts as input a data set D consisting of media items, each labeled as belonging to 0 or more classes from a set or hierarchy of classes S. Steps 420 and 430 may be permuted or carried out simultaneously. In 420, the system constructs a single vector representation of text features and/or audio features extracted or associated with each media item in D. These features may be present in a transcript produced by voice recognition software, in close-captioned text, or in open-captioned text. Some features may indicate the presence of or the character, appropriately quantized, of other audible characteristics of the media segment, such as music, silence, and loud noises. In 430, the system constructs a single vector representation of the visual features extracted or associated with each media item in D. In 440, for each labeled media item in the data set D, the system constructs a training set T(D) by combining the two vector representations of that media segment (constructed in 420 and 430) into a single composite feature vector, with the resulting vector labeled by the same set of classes used to label the media item. Optionally, in 440, before combining the vector representations, the system may uniformly transform one or both of those set of representations in order to assure compatibility. Among the ways that incompatibility may arise may be (1) a marked difference in the number of values that may appear as components of the vectors and (2) a marked difference in the norms or sizes of the vectors present in the two sets. The exact criterion for what constitutes a marked difference between to the sets of vectors will depend in practice on the particular technique of supervised learning being employed, and it may idiosyncratically depend on the data set D. Thus, in practice, the criterion may be experimentally determined by the evaluation of different classifiers induced under different assumptions. At any rate, in 440, the system ultimately produces, normally by concatenation of the (possibly transformed) feature vectors produced in 420 and 430, a composite labeled feature vector is ultimately produced for each media item in D. In 450, the system uses a supervised learning technique--a wide variety of them exist--with T(D) as training data to induce a classifier... [FIG.16 & 22] further elaborate using visual structures with corresponding flow charts)
Regarding claim 15, Bolle, Su and Arth teach The method of claim 1, further comprising: calculating, in advance of a query being received and/or processed, one or more descriptors; (Bolle [0036] Herein, first fifteen labels defined based on these visual features (by text, the authors, mean superimposed text in the video) are defined, examples are "talking head" and "one text line." A technique using Hidden Markov models (HMM) is described to classify a given media item into predefined categories, namely, commercial, news, sitcom and soap. An HMM takes these labels as input and has observation symbols as output. The system consists of two phases, a training and a classification stage. Reference Dimitrova et al. is incorporated herein in its entirety. [0221] vi. The automatic generation of MPEG-7 descriptors, as defined by the International Organisation for Standardisation/Organisation Internationale de Normalisation, ISO/IEC JTC1/SC29/WG11 specification "Coding of Moving Pictures and Audio." These descriptors are metadata items (digitally encoded annotations) which would be embedded in the bitstreams of videos (television; movies), sometime between the time of content creation ("filming" or "capture") and the time of broadcast/release. These metadata items are then available to all downstream processes (post-production/editing stages of preparation of the complete video product, distribution channels for movie releases, or by receivers/viewers of the broadcast), for various purposes, in particular, retrieval from video archives by content-based querying (in other words, facilitating the finding of video clips of interest, or a specific video clip, from within large collections of video). The descriptors can be used to explicitly label events of interest in a video when they happen, such as the scoring of goals in soccer matches. Manually-controlled processes for creation of such annotations are available now, but the work is tedious and expensive.)						Receiving a query regarding the image data and/or the tag data; and providing a unified representation in relation to the query. ( Bolle [0039] Reference Smith et al. is incorporated herein in its entirety. A sophisticated video database browsing systems is described, the authors refer to browsing as "skimming." Much emphasis is placed on visual analysis for video interpretation and video summarization (the construction of two-dimensional depictions of the video to allow for nonlinear access). Visual analysis include scene break detection, camera motion analysis, and object detection (faces and superimposed text). The audio transcript is used to identify keywords in it. Term frequency inverse document frequency techniques are used to identify critical words. Words that appear frequently in a particular video segment but occur infrequently in standard corpuses receive the highest weight. In Smith et al. the speech recognition is not automated yet, and closed-captioning is used instead. Video search is accomplished through the use of the extracted words as search keys, browsing of video summaries then allows for quickly finding the video of interest [0221] The automatic generation of MPEG-7 descriptors, as defined by the International Organisation for Standardisation/Organisation Internationale de Normalisation, ISO/IEC JTC1/SC29/WG11 specification "Coding of Moving Pictures and Audio." These descriptors are metadata items (digitally encoded annotations) which would be embedded in the bitstreams of videos (television; movies), sometime between the time of content creation ("filming" or "capture") and the time of broadcast/release. These metadata items are then available to all downstream processes (post-production/editing stages of preparation of the complete video product, distribution channels for movie releases, or by receivers/viewers of the broadcast), for various purposes, in particular, retrieval from video archives by content-based querying (in other words, facilitating the finding of video clips of interest, or a specific video clip, from within large collections of video). The descriptors can be used to explicitly label events of interest in a video when they happen, such as the scoring of goals in soccer matches. Manually-controlled processes for creation of such annotations are available now, but the work is tedious and expensive.)[0224] A first application is locating (illegal) copies of media items on the Internet or other (public) databases. This application involves searching for digital copies of media elements on the Internet or other (public) databases. With the wide spread use of digital media (audio and video), the illegal copying and distribution of media are becoming a significant problem for the media industry. )
 Regarding claim 17, Bolle, Su and Arth teach The method of claim 15, wherein: the image data comprises one or more embedded images; the one or more descriptors are calculated in relation to each of the one or more embedded images; (Bolle [0036] Herein, first fifteen labels defined based on these visual features (by text, the authors, mean superimposed text in the video) are defined, examples are "talking head" and "one text line." A technique using Hidden Markov models (HMM) is described to classify a given media item into predefined categories, namely, commercial, news, sitcom and soap. An HMM takes these labels as input and has observation symbols as output. The system consists of two phases, a training and a classification stage. Reference Dimitrova et al. is incorporated herein in its entirety.[0145] Categorization segments of a media item M, is aided by using some color quantization, for example, the following frame color codes. The color space of frames ( images) has been extensively used for indexing and searching based on image content. Application of hue color code, a preferred embodiment of this invention, is comprised of a number of steps (see FIG. 11A).  [0221] vi. The automatic generation of MPEG-7 descriptors, as defined by the International Organisation for Standardisation/Organisation Internationale de Normalisation, ISO/IEC JTC1/SC29/WG11 specification "Coding of Moving Pictures and Audio." These descriptors are metadata items (digitally encoded annotations) which would be embedded in the bitstreams of videos (television; movies), sometime between the time of content creation ("filming" or "capture") and the time of broadcast/release. These metadata items are then available to all downstream processes (post-production/editing stages of preparation of the complete video product, distribution channels for movie releases, or by receivers/viewers of the broadcast), for various purposes, in particular, retrieval from video archives by content-based querying (in other words, facilitating the finding of video clips of interest, or a specific video clip, from within large collections of video). The descriptors can be used to explicitly label events of interest in a video when they happen, such as the scoring of goals in soccer matches. Manually-controlled processes for creation of such annotations are available now, but the work is tedious and expensive.)											and a shape descriptor is calculated according to an average of the one or more descriptors in relation to each of the one or more embedded images, optionally wherein the average is biased according to one or more of the weights. (Bolle [0023]This can be a repetitive pattern of primitives (texels), or, can be more random, i.e., structural textures and statistical textures. Computational texture measures are either region-based or edge-based, trying to capture structural textures and statistical textures, respectively. In "VeggieVision" to Bolle et al., a texture representation of an image, image class, or image category, then, is a one-dimensional histogram of local texture feature values. Shape can also be represented in terms of frequency distribution. The information available to work with is the two-dimensional boundary of (say) a segmented image. Boundary shape is a feature of multiple boundary pixels and is expressed by a local computational feature, for example, curvature. Local curvature is estimated by fitting a circle at each point of the boundary. After smoothing, this boundary shape feature is quantized and a histogram is computed. Instead of over an area, such as for color histograms, these histograms are computed from a collection of image pixels that form the boundary of the object image. Finally, size of image segments is another feature of the images that is important in "VeggieVision" to Bolle et al. A method that computes area from many collections of three boundary points is proposed. Three points determine a circle and, hence, a diameter D. A histogram of these diameter estimates is then used as a representation for objects (in the image) size.  [0247] The communication links 2353, 2356, . . . , and 2359 can be either overtly or covertly monitored, by processes which are indicated with the circular shapes, 2363, 2366 through 2369. As part of these processes, the intercepted multimedia items can be categorized in terms of category, subject, topic, object, etc. For each individual user, a profile, profiles 2372, 2376, . . . , 2379, i.e., Profiles j, j=1, . . . , m, are generated by building an textual-visual feature vector F from the intercepted multimedia. The profiles can be simply a set of such vectors)
 Claims 10 and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Arth, Su and US 20160350336 A1; Checka; Neal et al. (hereinafter Checka)
 Regarding claim 10, Bolle, Su and Arth teach The method of claim 6			Bolle lacks explicitly teaching wherein the neural network comprises one or more fully-connected layers.											However Checka teaches wherein the neural network comprises one or more fully-connected layers. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network. This is motivated by the observation that the earlier features of a convolutional neural network include more generic features (e.g., edge detectors or color blob detectors) that may be useful to many tasks, but later layers of the convolutional neural network becomes progressively more specific to the details of the classes contained in the original dataset. [0034] Combining multiple convolutional neural networks and editing models: Given multiple individually trained models for different stages of the system, the different models may be combined into one single architecture by performing "net surgery". Using net surgery techniques, layers and their parameters from one model may be copied and merged into another model, allowing results to be obtained with one forward pass, instead of loading and processing multiple models sequentially. Net surgery also allows editing model parameters. This may be useful in refining filters by hand, if required. It is also helpful in casting fully connected layers to fully convolutional layers to facilitate generation of a classification map for larger inputs instead of one classification result for the whole image.)											Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all Bolle's methods and make the addition of Checka in order to improve the system ability to process image data via increased capabilities (Checka [0005] Various image processing methods are known in the art. Typically, such image processing methods require human intervention. For example, a human may need to assign descriptors and/or labels to the images being processed. This can be time consuming and expensive. There is a need in the art for improved systems and methods for processing image data. [0034]Using net surgery techniques, layers and their parameters from one model may be copied and merged into another model, allowing results to be obtained with one forward pass, instead of loading and processing multiple models sequentially. Net surgery also allows editing model parameters. This may be useful in refining filters by hand, if required. It is also helpful in casting fully connected layers to fully convolutional layers to facilitate generation of a classification map for larger inputs instead of one classification result for the whole image. )
 Regarding claim 12, the combination of Bolle, Arth, Su and Checka teach The method of claim 10, wherein one or more parameters of the one or more fully-connected layers are updated to minimize a total Euclidean loss. (Checka [0039] The features output by the convolution neural network may be tailored to new image search tasks and domains using a visual similarity learning algorithm. Provided labeled similar and dis-similar image pairs, this is accomplished by adding a layer to the deep learning architecture that applies a non-linear transformation of the features such that the distance between similar examples is minimized and that of dis-similar ones is maximized as illustrated in FIG. 4. The Siamese network learning algorithm may be used (Disclosed in S. Chopra, R. Hasdell, and Y. LeCun, "Learning a Similarity Metric Discriminatively, with Application to Face Verification", In the Proceedings of CVPR, 2005, and R. Hadsell, S. Chopra and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping". In the Proceedings of CVPR, 2006), each of which publication is hereby incorporated herein by reference in its entirety. This optimizes a contrastive loss function: ...[0059] In an exemplary embodiment, the algorithm is as follows: [0060] 1. Select K points as the initial centroids. This selection is accomplished by randomly sampling dense regions of the feature space. [0061] 2. Loop [0062] a. Form K clusters by assigning all points to the closest centroid. The centroid is typically the mean of the points in the cluster. The "closeness" is measured according to a similarity metric such as, but not limited to, Euclidean distance, cosine similarity, etc. The Euclidean distance is defined as:... [62-70] further elaborate on the use of the function(s) )
 Regarding claim 13, the combination of Checka, Arth, Su and Bolle teach The method of claim 12, further comprising calculating the total Euclidean loss through consideration of the smallest Euclidean difference between two points. (Checka [0039] The features output by the convolution neural network may be tailored to new image search tasks and domains using a visual similarity learning algorithm. Provided labeled similar and dis-similar image pairs, this is accomplished by adding a layer to the deep learning architecture that applies a non-linear transformation of the features such that the distance between similar examples is minimized and that of dis-similar ones is maximized as illustrated in FIG. 4. The Siamese network learning algorithm may be used (Disclosed in S. Chopra, R. Hasdell, and Y. LeCun, "Learning a Similarity Metric Discriminatively, with Application to Face Verification", In the Proceedings of CVPR, 2005, and R. Hadsell, S. Chopra and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping". In the Proceedings of CVPR, 2006), each of which publication is hereby incorporated herein by reference in its entirety. This optimizes a contrastive loss function: ...[0059] In an exemplary embodiment, the algorithm is as follows: [0060] 1. Select K points as the initial centroids. This selection is accomplished by randomly sampling dense regions of the feature space. [0061] 2. Loop [0062] a. Form K clusters by assigning all points to the closest centroid. The centroid is typically the mean of the points in the cluster. The "closeness" is measured according to a similarity metric such as, but not limited to, Euclidean distance, cosine similarity, etc. The Euclidean distance is defined as:... [62-70] further elaborate on the use of the function(s) )
 Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Checka, Arth, Su and US 9792492 B2; Soldevila; Albert Gordo et al. (hereinafter Soldevila).
 Regarding claim 11, The combination of Bolle and Soldevila teach The method of claim 10, wherein the one or more fully-connected layers are operable to return a vector												The combination explicitly teaching return a vector of the same dimensionality of the image data 												However Soldevila teaches return a vector of the same dimensionality of the image data (Soldevila [Col. 8 lines 57-64] At each fully-connected layer of the sequence 86, the input vector 106, 112, 114 is converted to an output vector 112, 114, 116, which may have the same or fewer dimensions (or in some cases, more dimensions). The output 116 of the final fully-connected layer 92 is used to generate the set of predictions 60. Each prediction is a class probability for a respective one of the classes in the set of classes. )									Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all prior methods and make the addition of Soldevila neural network methods in order to further improve the image classification abilities of the system ( Soldevila [Col. 2, lines 30-35] The present system and method provide an efficient way to use ConvNets for generating representations that are particularly useful for computing similarity. [Col. 4, lines 10-15] The neural network-based gradient representation can lead to consistent improvements with respect to alternative methods that represent an image using only quantities computed during the forward pass.)	
	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Bolle in view of Arth, Su and Chavez; Alexander Kikuta et al.; US 10685057 B1 (hereinafter Chavez)
 Regarding claim 14, Bolle and Arth teach The method of claim 6, wherein the neural network 												but lacks explicitly teaching is operable to minimize a softmax loss				However Chavez teaches is operable to minimize a softmax loss (Chavez [col. 9, lines 30-65]  The server 130 includes a memory 232, a processor 236, and a communications module 238. The memory 232 of the server 130 includes a convolutional neural network 240. In one or more implementations, the convolutional neural network 240 may be a series of neural networks, one neural network for each style classification. As discussed herein, a convolutional neural network 240 is a type of feed-forward artificial neural network using a supervised learning algorithm, where individual neurons are tiled in such a way that the individual neurons respond to overlapping regions in a visual field. The architecture of the convolutional neural network 240 may be in the style of existing well-known image classification architectures such as AlexNet, GoogLeNet, or Visual Geometry Group models. In certain aspects, the convolutional neural network 240 consists of a stack of convolutional layers followed by several fully connected layers. The convolutional neural network 240 can include a loss layer (e.g., softmax or hinge loss layer) to back-propagate errors so that the convolutional neural network 240 learns and adjusts its weights to better fit provided image data.The memory 232 also includes a collection of images 252 and an image search engine 242 for searching the collection of images 252. In one or more implementations, the collection of images 252 represents a database that contains, for each image, a mapping from an image identifier (e.g., a tag) to a data file containing pixel data for the image (e.g., in jpeg format). The collection of images 252 can be, for example, a dataset of images used for training corresponding to a number of style classes (e.g., about 25). Each of the images may include an indication of its respective style classes applicable to the image. The images may be paired with image vector information and image cluster information. The image vector information may identify vectors representing a large sample of images (e.g., about 50 million) and the image cluster information may identify the vectors in one or more clusters such that each cluster of images represents a semantic concept (e.g., "weather," "time-of-day," "season," etc.).[FIG.4] show a flow of the corresponding methods of classification)				Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all Bolle's methods and make the addition of Checka in order to further enhance the classification of the system via increased capabilities (Chavez [col. 9, lines 30-65]  The server 130 includes a memory 232, a processor 236, and a communications module 238. The memory 232 of the server 130 includes a convolutional neural network 240. In one or more implementations, the convolutional neural network 240 may be a series of neural networks, one neural network for each style classification. As discussed herein, a convolutional neural network 240 is a type of feed-forward artificial neural network using a supervised learning algorithm, where individual neurons are tiled in such a way that the individual neurons respond to overlapping regions in a visual field. The architecture of the convolutional neural network 240 may be in the style of existing well-known image classification architectures such as AlexNet, GoogLeNet, or Visual Geometry Group models. In certain aspects, the convolutional neural network 240 consists of a stack of convolutional layers followed by several fully connected layers. The convolutional neural network 240 can include a loss layer (e.g., softmax or hinge loss layer) to back-propagate errors so that the convolutional neural network 240 learns and adjusts its weights to better fit provided image data.The memory 232 also includes a collection of images 252 and an image search engine 242 for searching the collection of images 252. In one or more implementations, the collection of images 252 represents a database that contains, for each image, a mapping from an image identifier (e.g., a tag) to a data file containing pixel data for the image (e.g., in jpeg format). The collection of images 252 can be, for example, a dataset of images used for training corresponding to a number of style classes (e.g., about 25). Each of the images may include an indication of its respective style classes applicable to the image. The images may be paired with image vector information and image cluster information. The image vector information may identify vectors representing a large sample of images (e.g., about 50 million) and the image cluster information may identify the vectors in one or more clusters such that each cluster of images represents a semantic concept (e.g., "weather," "time-of-day," "season," etc.).[FIG.4] show a flow of the corresponding methods of classification)
 Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Checka in view of Su and US 20030004966 A1; Bolle, Rudolf M. et al. (hereinafter Bolle)
Regarding claim 22, Checka teaches A method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network.[0036] Prior to analyzing images represented by an image dataset, each image may be resized (or sub-window cropped) to a canonical size (e.g., 224.times.224). Each cropped image may be fed through the trained network, and the output at the first fully connected layer is extracted. The extracted output may be a 4096 dimensional feature vector representing the image and may serve as a basis for the image analysis. To facilitate this, well-established open-source libraries such as, but not limited to, LIBSVM and FLANN (Fast Library for Approximate Nearest Neighbors) may be used. [0037] In order to handle geometric variations in images, a spatial transformer may be used. The spatial transformer module may result in models which learn translation, scale and rotation invariance. A spatial transformer is a module that learns to transformer feature maps within a network that correct spatially manipulated data without supervision. A description of spatial transformer networks can be found in the following publication: M. Jaderberg K. Simonyan A. Zisserman K. Kavukcuoglu, "Spatial Transformer Networks", Advances in Neural Information Processing Systems 28 (NIPS), 2015, which publication is hereby incorporated herein by reference to its entirety. A spatial transformer may help localize objects, normalizing them spatially for better classification and representation for visual search. [53-58] further elaborate on the systems ability to query using image/tag data in process that involves finding similar data/vector [FIG. 1 & 14] show a visual of the system capable of querying using image/tag data in process that involves finding similar data/vector )				determining a vector representation for the query; determining … which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query. (Checka [0031] To design a deep learning architecture, the present methods and systems may implement various transfer learning strategies. Examples of such strategies include, but are not limited to: [0032] Treating the convolutional neural network as a fixed feature extractor: Given a convolutional neural network pre-trained on ImageNet, the last fully connected layer may be removed, then the convolutional neural network may be treated as a fixed feature extractor for the new dataset. ImageNet is a publicly available image dataset including 14,197,122 annotated images (disclosed in J. Deng, W. Dong, R. Socher, L. Li, and F-F. Li, "ImageNet: A Large Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition, 2009), which publication is hereby incorporated by reference in its entirety. The result may be an N-D vector, known as a convolutional neural network code, which contains the activations of the hidden layer immediately before the classifier/output layer. The convolutional neural network code may then be applied to image classification or search tasks as described further below. [0033] Fine-tuning the convolutional neural network: Given an already learned model, the architecture may be adapted and backpropagation training may be resumed from the already learned model weights. One can fine-tune all the layers of the convolutional neural network, or keep some of the earlier layers fixed (due to overfitting concerns) and then fine-tune some higher-level portion of the convolutional neural network.[0036] Prior to analyzing images represented by an image dataset, each image may be resized (or sub-window cropped) to a canonical size (e.g., 224.times.224). Each cropped image may be fed through the trained network, and the output at the first fully connected layer is extracted. The extracted output may be a 4096 dimensional feature vector representing the image and may serve as a basis for the image analysis. To facilitate this, well-established open-source libraries such as, but not limited to, LIBSVM and FLANN (Fast Library for Approximate Nearest Neighbors) may be used. [0037] In order to handle geometric variations in images, a spatial transformer may be used. The spatial transformer module may result in models which learn translation, scale and rotation invariance. A spatial transformer is a module that learns to transformer feature maps within a network that correct spatially manipulated data without supervision. A description of spatial transformer networks can be found in the following publication: M. Jaderberg K. Simonyan A. Zisserman K. Kavukcuoglu, "Spatial Transformer Networks", Advances in Neural Information Processing Systems 28 (NIPS), 2015, which publication is hereby incorporated herein by reference to its entirety. A spatial transformer may help localize objects, normalizing them spatially for better classification and representation for visual search. [53-58] further elaborate on the systems ability to query using image/tag data in process that involves finding similar data/vector [FIG. 1 & 14] show a visual of the system capable of querying using image/tag data in process that involves finding similar data/vector)				Checka lacks explicitly and orderly teaching combining the vector representations by performing vector calculus…utilizing said combined vector representation											Bolle helps teach combining the vector representations by performing vector calculus…utilizing said combined vector representation (Bolle [0055] a set of word features is developed. These features are derived from multiple knowledge sources: prosodic features, cue phrase features, noun phrase features, combined features.  [0080] The textual feature vector and the visual feature vector are combined by concatenation to produce a unified representation in a single vector of disparate kinds of features. Having created the unified representation of the training data, standard methods of classifier induction are then used, followed by appropriate business processes and business decisions. [0106] FIG. 15 shows the process of combining visual and textual feature vectors to obtain a vector representing the disparate sources of information in the media item. [125-126 & 172-176] further elaborates on the process of having vectors/representations of the image and textual data [FIG.2-4] shows a visual of the system through the corresponding structures and their corresponding flows)										the combination lack explicitly and orderly teaching wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, computing a descriptor for each view, and averaging the descriptors for each view.				However Su teaches wherein the vector representations of the three dimensional objects are computed by rendering views for each of the three dimensional objects from multiple viewpoints, (Su [Section 2.0 para. 4 , 3.1 para. 2-3] show the different viewpoints for 3-D shapes [FIG. 5 and 7] show a visual of the viewpoints for 3D shapes)									computing a descriptor for each view (Su [section 3.0 para. 1-3, section 3.2 para. 1-6] show the descriptors [FIG. 6 and 7] show visual)					and averaging the descriptors for each view ( Su [ section 3.0 para. 2, section  3.2 para.5-6] show averaging the descriptors for the view  [Table 1-2 and eq. 1] show visual chart and equation for averaging)							Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all prior methods and make the addition of Su's 3D shape recognition methods in order to create a system with higher computational efficiency (Su [section 3.0, para. 3, section 5, para. 2] improvements in computational efficiency due to corresponding methods)
 Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over Checka in view of Bolle, Su and US 20170221243 A1; Bedi; Ajay et al.
 Regarding claim 23, Checka, Su, and Bolle teaches The method of claim 22, 		But lacks explicitly teaching further comprising the step of replacing objects in an image with contextually similar objects.								However Bedi helps teach further comprising the step of replacing objects in an image with contextually similar objects (Bedi [0005] In addition to enabling removal of various objects from within digital photos, many conventional systems and methods replace the removed object with a replacement portion. For example, in response to removing an object from a digital photo, conventional systems and methods often replace the removed object with a replacement portion that includes similar features as the portion of the digital photo surrounding the removed object (e.g., using a digital photo fill process). In this way, a user can modify or otherwise edit a photo to remove an undesired object while maintaining continuity in the background of the digital photo. Nonetheless, removing and replacing objects within digital photos has various drawbacks and limitations with conventional systems and methods. [0006] In particular, where digital photos have lines, shapes, or other geometric features in the background of the digital photo, conventional systems and methods often fail to maintain a sense of continuity in the background after removing and replacing a removed object within the digital photo. For example, where a background of a digital photo includes lines that intersect a portion of an object to be removed, conventional systems and methods fail to generate or provide a replacement background that includes lines that align with or otherwise match surrounding portions of the digital photo around the removed object. As a result, replacing a removed object within a digital photo often results in a discontinuous or otherwise erroneous representation of the background. The discontinuous and/or erroneous representation of modified photos that result from conventional systems and methods diminishes the visual quality of the resulting modified digital photo and is frustrating to users. [0007] In addition to failing to correctly account for lines, shapes, or other irregularities in the background of the digital photo, conventional devices also fail to adequately compensate for different surface plans within the digital photo. In particular, where the background of a digital photo includes different planes having different perspectives, conventional systems and methods often fail to provide a replacement background that compensates for the perspective difference of the different planes within the digital photo. As such, replacing an object within a digital photo that includes multiple background planes often includes skewed or distorted portions of the replacement portion and similarly diminishes the quality of the resulting modified digital photo.)									Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to take all Checka's methods and make the addition of Bedi in order to help enhance the system via manipulation of images (Bedi  [0004] Computing devices (e.g., computers, tablets, smart phones) provide numerous ways for people to capture, create, share, view, and otherwise interact with numerous types of digital content. As such, users are increasingly using computing devices to interact with and modify digital photographs (or simply "digital photos"). For example, many computing devices enable users to enhance a digital photo by cropping or otherwise removing a portion of the digital photo. For instance, a computing device can edit a digital photo by removing an object, a person, or other feature in the digital photo. [0005] In addition to enabling removal of various objects from within digital photos, many conventional systems and methods replace the removed object with a replacement portion. For example, in response to removing an object from a digital photo, conventional systems and methods often replace the removed object with a replacement portion that includes similar features as the portion of the digital photo surrounding the removed object (e.g., using a digital photo fill process). In this way, a user can modify or otherwise edit a photo to remove an undesired object while maintaining continuity in the background of the digital photo. Nonetheless, removing and replacing objects within digital photos has various drawbacks and limitations with conventional systems and methods. [0010] As such, the systems and methods provide a modified output image that includes a continuous and clean background portion using a geometrically adjusted and aligned source portion to replace a removed object (or other target portion) from within the input image.[96] In this way, the feature identifier 512 can analyze pixels identified as edges within the digital image while ignoring other non-edge pixels and more efficiently computing or otherwise identifying lines and other geometric features within the digital image)
Response to Arguments
Applicant's arguments filed 9/14/2022 have been fully considered
35 USC § 102 & 35 USC § 103: 
Regarding Applicant’s Argument (page(s):6-7): Examiner’s response:- Applicant’s arguments, filed 9/14/2022, with respect to the rejection(s) of under 35 USC § 102/103  have been fully considered and are persuasive. Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of NPL Multi-view Convolutional Neural Networks for 3D Shape Recognition; 9/27/2015; Hang; Su et al. (Su). The examiner recommends adding steps further elaborating on how the "descriptors" are computed, what conditions or parameters are looked at. Another area to further specify to help overcome the current art is further elaborate on conditions or parameters that are consideration when creating a "unified representation".
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ARYAN D TOUGHIRY whose telephone number is (571)272-5212. The examiner can normally be reached Monday - Friday, 9 am - 5 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aleksandr Kerzhner can be reached on (571) 270-1760. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ARYAN D TOUGHIRY/Examiner, Art Unit 2165                                                                                                                                                                                                        
/ALEKSANDR KERZHNER/Supervisory Patent Examiner, Art Unit 2165