DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Request for Reconsideration
The request for continued examination filed 2022-04-28 has been received.  No amendments have been made to the claims.  Claims 1-20 remain pending in the application.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 U.S.C. 103 have been fully considered but they are not persuasive.  
Applicant argues on Remarks Pages 9-11 that Yu does not teach “creating a respective user feature vector representative of an identity of a user source of respective multimodal content”, because “In contrast, in Yu, what is embedded is a vector representative of an average of all of the microblog with which a particular user interacted and not a vector representative of content for which a user is a source as taught and claimed by at least the Applicant's independent claim 1.”  Examiner respectfully disagrees, as even if the vector comprises an average of all the microblogs with which a particular user interacted, that information is still “representative of an identity of a user”, as an average of one’s preferences is still some indicator of one’s identity (one of ordinary skill in the art will appreciate that this is apparent with the targeted marketing that follows online tracking of users, for example).  Furthermore, Yu discloses that the user is a “source”, as the “interactions” of Yu include “tweeting”, as Yu discloses on Page 450 Section 3.1: “We assume that a user tweeting, retweeting or commenting on a microblog text reflects that the user is interested in that microblog”.  If the user is the person that initially “tweets”, then the user is the source of that content.  Examiner finally reiterates Yu’s statement on Page 450 Section 3.1:  “The baseline averages the vector representation of microblog texts into a user vector representation.”  Yu here clearly discloses a “user vector representation”.  It is unclear to Examiner, from Applicant’s arguments, how a “user vector” could not be representative in any way of a user’s identity.
Applicant argues on Remarks Page 11 that “There is absolutely no teaching or suggestion in Yu for creating a respective user feature vector representative of an identity of a user that is the source of respective multimodal content for each of the plurality of content of the multimodal content having the first modality and the second modality as claimed by at least the Applicant's independent claim 1.”  Examiner respectfully points out that multimodal content having a first and second modality was established by Gao in Examiner’s mapping, before combining with Yu to teach the vector representing the identity of a user source of content, as was explained in the 35 USC 103 rejections in the previous (and current) office action.  In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, 5, 8, 9-12, 13, 16, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Gao et. al. (US 2017/0061250 A1; hereinafter Gao) in view of Yu et. al. (“User Embedding for Scholarly Microblog Recommendation”) and Nickel et. al. (“Poincaré Embeddings for Learning Hierarchical Representations”; hereinafter Nickel).
As per Claim 1, Gao teaches A method of creating a semantic embedding space for multimodal content for improved recognition of at least one of content, content-related information and events, the method comprising (Gao, Abstract, discloses:  “Disclosed herein are technologies directed to discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images using a caption generator. A semantic similarity framework can include a caption generator and can be based on a deep multimodal similar model. The deep multimodal similarity model can receive sentences and determine the relevancy of the sentences based on similarity of text vectors generated for one or more sentences to an image vector generated for an image. The text vectors and the image vector can be mapped in a semantic space, and their relevance can be determined based at least in part on the mapping. The sentence associated with the text vector determined to be the most relevant can be output as a caption for the image.”  Here, Gao discloses a semantic embedding space (“semantic space”) for multimodal content (text and image) for improved recognition of content (“The sentence associated with the text vector determined to be the most relevant can be output as a caption for the image”)).
creating a first training set by, for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality (Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space”.  Here, Gao discloses a first modality feature vector (“image vector”) representative of content of the multimodal content having a first modality (“image representation”).  Gao, Para [0055], discloses that the “image model” is a first machine learning model: “FIG. 3 is an overview of a DMSM 300, which in some cases can represent the DMSM 230. The DMSM 300 can be used to estimate similarity between an image and a sentence. In examples, the DMSM 300 can uses a pair of neural network models, image model 302, such as image model 240, and text model 304, such as text model 242. As illustrated in FIG. 3, image model 302 and text model 304 are included in DMSM 300, but in some examples they may be separate as shown in FIG. 2.  Here, Gao discloses that the image model is one of a “pair of neural network models”.  Thus, by disclosing a first machine learning model, which requires training, Gao discloses a first training set.  Gao, Para [0072], discloses “FIGS. 7 and 8 are flow diagrams depicting aspects of discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images”.  Here, Gao discloses creating the vectors for each of a plurality of content of the multimodal content.  Gao discloses that this is used for “performing image search using a textual query, performing text search using an image as a query”.  This implies that the embedding has been done for each of a plurality of multimodal content, as in order to complete a “search” for a “query”, more than one (a plurality) entities must be in the search space.) 
creating a second training set by, for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality (Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space.”  Here, Gao discloses a second modality feature vector (“text vector”) representative of content of the multimodal content having a second modality (“text”).  Gao, Para [0055], discloses that the “text model” is a second machine learning model: “FIG. 3 is an overview of a DMSM 300, which in some cases can represent the DMSM 230. The DMSM 300 can be used to estimate similarity between an image and a sentence. In examples, the DMSM 300 can uses a pair of neural network models, image model 302, such as image model 240, and text model 304, such as text model 242. As illustrated in FIG. 3, image model 302 and text model 304 are included in DMSM 300, but in some examples they may be separate as shown in FIG. 2.  Here, Gao discloses that the text model is one of a “pair of neural network models”.  Thus, by disclosing a second machine learning model, which requires training, Gao discloses a second training set. Gao, Para [0072], discloses “FIGS. 7 and 8 are flow diagrams depicting aspects of discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images”.  Here, Gao discloses creating the vectors for each of a plurality of content of the multimodal content.  Gao discloses that this is used for “performing image search using a textual query, performing text search using an image as a query”.  This implies that the embedding has been done for each of a plurality of multimodal content, as in order to complete a “search” for a “query”, more than one (a plurality) entities must be in the search space.)
However, Gao does not teach creating a third training set by, for each of the plurality of content of the multimodal content having the first modality and the second modality, creating a respective user feature vector representative of an identity of a user source of respective multimodal content; a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between embedded modality feature vectors of content in the common geometric space.
Yu teaches creating a third training set by, for each of the plurality of content of the [multimodal] content having the first modality [and the second modality], creating a respective user feature vector representative of an identity of a user source of respective [multimodal] content (Recall above that Gao teaches multimodal content.  Yu, Page 450 Section 3.1, discloses:  “We denote a set of users by u = {u1,u2, …, um}, and a set of microblog texts by d = {d1, d2, …, dn}. We assume that a user tweeting, retweeting or commenting on a microblog text reflects that the user is interested in that microblog. Given ut e u, we denote the set of microblogs that ut is interested in by d(ut). In our task, the entire sets of d and u are given, while given a user ut e u, only a subset of d(ut) is known. This subset is used as the training set, denoted as d~(ut). Our task aims to retrieve a subset d' of d, that d' is as similar to d(ut) - d~(ut) as possible. In this section, we introduce one baseline method and then propose two different neural network methods for user and microblog embedding. The baseline averages the vector representation of microblog texts into a user vector representation. Our proposed two methods learn user vector representations jointly with word and text vectors, either indirectly or directly from word vectors.”  Here, Yu discloses creating a third training set (“This subset is used as the training set”) by creating a user feature vector representative of an identity (“user vector representations”) of a user source of respective content (“user tweeting… microblog text”).  A “microblog” is a “tweet”, and thus the user that tweeted the microblog text is the source of the microblog content.)
Gao and Yu are analogous art because they are both in the field of endeavor of machine learning.
It would have been obvious before the effective filing date of the claimed invention to combine the multimodal content embedding of Gao with the content and user embedding of Yu. One of ordinary skill in the art would be motivated to do so in order to take advantage of a more efficient way of comparing users with content to make useful recommendations to users, saving time instead of searching manually (Yu, Page 449 Intro Para 1-2: “The volume of scholarly microblog texts is huge, which makes it time-consuming for a researcher to browse and find the ones that he or she is interested in. In this study, we aim to build a personalized recommendation system for recommending scholarly microblogs. With such a system a researcher can easily obtain the scholarly microblogs he or she has interests in.”)
The combination of Gao and Yu further teaches training the semantic embedding space by using a machine learning process to semantically embed the respective, first modality feature vectors of the first training set and the respective, second modality feature vectors of the second training set, and  the user feature vectors of the third training set in a common geometric space (Gao, Para [0054], explicitly discloses the training with two training sets:  “In some examples, the DMSM 230 uses a pair of neural networks, an image model 240 and a text model 242, one for mapping each input modality to a common semantic space, which are trained jointly.”)  Gao, Para [0057], discloses:  “FIG. 4 is an example illustration showing the mapping of an image vector such as image vector 306 and a text vector such as text vector 308 into a semantic space 402. Using a DMSM such as DMSM 230 and/or 300, the image vector 306, represented as β image, is mapped into the semantic space 402. In some examples, the image vector 306 and the text vector 308 are low dimensional vectors. Using a DMSM such as DMSM 230 and/or 300, the text vector 308, represented as β text+, is mapped into the semantic space 402. If another text vector 308 is available (e.g., another sentence was analyzed using the DMSM), the additional text vector 308, represented as β text−, is mapped into the semantic space 402.  Here, Gao discloses, for the first modality feature vectors and the respective, second modality feature vectors (“an image vector such as image vector 306 and a text vector such as text vector 308“), semantically embedding in a common geometric space (“mapping…vector…into a semantic space 402”).
Yu also discloses semantically embedding in a common geometric space, wherein Yu discloses embedding content and the user source of the content, in a common geometric space, as shown in Page 451 Section 3.6:  “When recommending microblogs, given a microblog dj and a user uk, we compute the cosine distance between their vector representations, and use the cosine distance to determine whether dj should be recommended to uk or not.”  Here, a cosine distance is calculated between vector representations, indicating they are in a common geometric space.
Thus, the combination of Gao’s two modalities in the common geometric space, and Yu’s modality and user in a common geometric space, results in the claimed combination of two modalities and a user in a common geometric space.)
wherein embedded feature vectors that are related, across modalities, are closer together in the common geometric space than unrelated feature vectors. (Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space. The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310. The cosine semantic similarity 310 of each of the sentences inputted into the DMSM 300 can be compared to determine a sentence having the highest similarity (i.e. the image vector 306 and the text vector 308 are more similar than other sentences for the same image). In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308. As used herein, relevance means that the text and image are semantically similar. It is noted that relevance can be defined using other technologies, such as, but not limited to, Euclidean distance between the image vector 306 and the text vector 308.” Here, Gao discloses embedded modality feature vectors (“image vector” and “text vector”) in the common geometric space (“hidden space…same space”) and that, across modalities (image and text), feature vectors that are related are closer together than unrelated (“The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308…The computed cosine distance can be outputted as a cosine semantic similarity 310… In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308”.  Here, Gao discloses those that are related (“relevance”) are closer together (“cosine distance between the image vector 306 and the text vector 308”)).
However, the combination of Gao and Yu does not teach a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between embedded feature vectors of content in the common geometric space.
Nickel teaches a common geometric space that provides logarithm-like warping of distance space in the common geometric space to capture hierarchical relationships between embedded modality feature vectors of content in the common geometric space (Nickel, Intro Para 4, discloses:  “To exploit this structural property for learning more efficient representations, we propose to compute embeddings not in Euclidean but in hyperbolic space, i.e., space with constant negative curvature. Informally, hyperbolic space can be thought of as a continuous version of trees and as such it is naturally equipped to model hierarchical structures. For instance, it has been shown that any finite tree can be embedded into a finite hyperbolic space such that distances are preserved approximately [10]. We base our approach on a particular model of hyperbolic space, i.e., the Poincaré ball model, as it is well-suited for gradient-based optimization. This allows us to develop an efficient algorithm for computing the embeddings based on Riemannian optimization, which is easily parallelizable and scales to large datasets. Experimentally, we show that our approach can provide high quality embeddings of large taxonomies – both with and without missing data. Moreover, we show that embeddings trained on WORDNET provide state-of-the-art performance for lexical entailment. On collaboration networks, we also show that Poincaré embeddings are successful in predicting links in graphs where they outperform Euclidean embeddings, especially in low dimensions.”  Here, Nickel discloses modality feature vectors (“embeddings trained on WORDNET”, wherein “embeddings” are understood in the art to be vectors, and the modality is text, and the features are words).  Nickel also discloses a common geometric space (“hyperbolic space”) to capture hierarchical relationships (“naturally equipped to model hierarchical structures”).  Nickel also discloses space that provides logarithm-like warping of distance space by disclosing “hyperbolic space”, as hyperbolic space is warped (“space with constant negative curvature”).  One of ordinary skill in the art will appreciate that the distance between two points in hyperbolic space p and q is calculated as:

    PNG
    media_image1.png
    46
    233
    media_image1.png
    Greyscale

where o and r denote the points where the geodesic meets the real axis.  The distance calculation comprises the operation “ln”, which is the “natural logarithm”.  Thus, hyperbolic space provides logarithm-like warping of distance space.)
	Nickel also teaches wherein embedded modality feature vectors that are related are closer together in the common geometric space than unrelated modality feature vectors (Nickel, Section 3, discloses:  “In the following, we are interested in finding embeddings of symbolic data such that their distance in the embedding space reflects their semantic similarity”.  Here, Nickel discloses modality feature vectors (“embeddings”) that are related are closer together in the common geometric space than unrelated modality feature vectors (“their distance in the embedding space reflects their semantic similarity”).
	Nickel and the combination of Gao and Yu are analogous art because they are in the field of endeavor of machine learning.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Gao and Yu, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract:  “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space – or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”)

	As per Claim 4, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as further comprising: projecting at least one of content, content-related information, and an event into the common geometric space; (Gao, Para [0056], discloses “FIG. 4 is an example illustration showing the mapping of an image vector such as image vector 306 and a text vector such as text vector 308 into a semantic space 402”.  Here, Gao discloses “projecting into the common geometric space (“mapping…vector…into a semantic space 402”).  This is done for “image” and “text”, which are forms of content.)
and determining at least one embedded feature vector in the common geometric space close to the projection as being related to the projected at least one of the content, the content-related information, and the event. (Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space. The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310. The cosine semantic similarity 310 of each of the sentences inputted into the DMSM 300 can be compared to determine a sentence having the highest similarity (i.e. the image vector 306 and the text vector 308 are more similar than other sentences for the same image). In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308. As used herein, relevance means that the text and image are semantically similar. It is noted that relevance can be defined using other technologies, such as, but not limited to, Euclidean distance between the image vector 306 and the text vector 308.” Here, Gao discloses an embedded feature vector (“text vector”) in the common geometric space (“hidden space…same space”) and from that, determining close to the projection as being related to the content (“The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308…The computed cosine distance can be outputted as a cosine semantic similarity 310… In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308”.  Here, Gao discloses those that are related (“relevance”) are close (“cosine distance between the image vector 306 and the text vector 308”) to the projection (“image vector 306”)).

As per Claim 5, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as wherein a second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality. (As shown in Claim 1, Gao Para [0056] discloses vectors for the first and second modalities (“In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308”).  Gao, Para [0040], discloses:  “In some examples, the data store 212 can act as a repository for a training set of data 238. The training set of data 238 is the corpus of data used by a caption generator 224, explained in more detail below. In some examples, the training set of data 238 can be generated by human and/or computer input, whereby the human or computer act as “teachers.” For example, one or more images can be presented and one or more words can be selected as being associated with each of the one or more images. The training set of data 238 can be used by the caption generator 224 for relativistic calculations. In some examples, the training set of data 238 can include images with more than one word (e.g. a phrase) associated with the image that act as a caption to the image. The words of the training set of data 238 can include different word types, including, but not limited to, nouns, verbs, and adjectives.”  Here, Gao discloses that the second modality feature (“image”) is created using information relating to respective content having a first modality (“In some examples, the training set of data 238 can include images with more than one word (e.g. a phrase) associated with the image that act as a caption to the image”).  Here, the image vector is trained using “caption to the image”, which is content having a first modality (text)).

As per Claim 8, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as wherein the common geometric space comprises a non-Euclidean space. (Nickel, Abstract, discloses:  “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space – or more precisely into an n-dimensional Poincaré ball.”  Here, Nickel discloses an alternative to Euclidean space, “hyperbolic space”, which is a non-Euclidean space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Gao and Yu, with the embedding in hyperbolic space of Nickel, for at least the reasons recited in Claim 1.

As per Claim 9, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as wherein the non-Euclidean space comprises at least one of a hyperbolic, a Lorentzian, and a Poincare ball. (Nickel, Abstract, discloses:  “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space – or more precisely into an n-dimensional Poincaré ball.”  Here, Nickel discloses an alternative to Euclidean space, “hyperbolic space – or more precisely into an n-dimensional Poincaré ball”).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Gao and Yu, with the embedding in hyperbolic space of Nickel, for at least the reasons recited in Claim 1.

As per Claim 10, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above.  Yu teaches wherein the multimodal content comprises multimodal content posted by an agent on a social media network. (Yu, Page 450 Section 3.1 discloses:  “We assume that a user tweeting, retweeting or commenting on a microblog text reflects that the user is interested in that microblog.”  Here, a user who tweets content is an agent on a social media network.)

As per Claim 11, the combination of Gao, Yu, and Nickel teaches the method of claim 10 as shown above, as well as wherein the agent comprises at least one of a computer, robot, a person with a social media account, and a participant in a social media network. (Yu, Page 450 Section 3.1 discloses:  “We assume that a user tweeting, retweeting or commenting on a microblog text reflects that the user is interested in that microblog.”  Here, a user who tweets content is a participant in a tweeting social media network.)

As per Claim 12, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as further comprising: inferring information for feature vectors embedded in the common geometric space based on a proximity of the feature vectors to at least one other feature vector embedded in the common geometric space. (Gao, Para [0002], discloses:  “Disclosed herein are technologies for discovering semantic similarities between images and text. Such techniques can be useful for performing image search using a textual query or text search using an image as a query or for generating captions for images. Examples of the technologies disclosed herein use a deep multimodal similarity model (“DMSM”). The DMSM learns two neural networks that map images and text fragments to vector representations, respectively. A caption generator uses the vector representations to measure the similarity between the images and associated texts. The caption generator uses the similarity to output a caption that has the highest probability of being associated with a particular image based on data associated with a training set and as used in the DMSM. In some examples, the use of the DMSM for generating captions for images can increase the accuracy of automatic caption generators, while also reducing the amount of human effort required to generate or correct captions.”  Here, Gao discloses inferring information (“output a caption that has the highest probability of being associated with a particular image”) for feature vectors (“vector representations to measure the similarity between the images and associated texts”).  Gao, Para [0056] discloses “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space. The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310.”  Here, Gao discloses based on a proximity of the feature vectors (“similarity between the image and the sentence can be computed as the cosine distance”) embedded in the common geometric space (“hidden space…same space”)).

As per Claim 13, Claim 13 is an apparatus claim corresponding to method Claim 1.  The difference is it recites a processor and a memory. (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”)  Claim 13 is rejected for the same reasons as Claim 1.

As per Claim 16, Claim 16 is an apparatus claim corresponding to method Claim 4.  The difference is it recites a processor and a memory. Claim 16 is rejected for the same reasons as Claim 4.

As per Claim 17, Claim 17 is a non-transitory computer-readable medium claim corresponding to method Claim 1.  The difference is it recites a processor and a non-transitory computer-readable medium. (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”).  Claim 17 is rejected for the same reasons as Claim 1.

As per Claim 20, Claim 20 is a non-transitory computer-readable medium claim corresponding to method Claim 4.  The difference is it recites a processor and a non-transitory computer-readable medium.  Claim 20 is rejected for the same reasons as Claim 4.

Claims 2, 3, 7, 14, 15, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Gao, Yu, and Nickel in view of Onoro et. al. (US 2019/0205964 A1; hereinafter Onoro).
As per Claim 2, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.  (Gao teaches capture relationships between first modality feature vectors and second modality feature vectors in [0053]: “The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308…The computed cosine distance can be outputted as a cosine semantic similarity 310… In some examples, the cosine semantic similarity 310 can be defined to be the relevance of the image vector 306 to the text vector 308”.  Here, Gao discloses capture relationships (“relevance”) between first modality feature vectors and second modality feature vectors (“cosine distance between the image vector 306 and the text vector 308”)).
However, the combination of Gao, Yu, and Nickel does not teach further comprising: for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; and semantically embedding the respective, combined multimodal feature vectors in the common geometric space
Onoro teaches further comprising: for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector. (Onoro, Para [0031-0033], discloses: “A merge block generates a vector representation of certain products with multiple data modalities. In FIG. 2 a merge block 202 for an example “product x” is illustrated. The inputs to the merge block 202 include an image 204, text 206, and audio clip 208. Each of the three illustrated modalities are processed by a neural network that casts them into an intermediate embedding space. For example, image 204 is processed by a neural network 210. Neural network 210 can include VGG, ResNet, LSTM, CNN, and other neural networks. Similarly, text 206 is processed be a neural network 212. The remaining inputs may also be processed by neural networks. Each of the neural networks produces a vector representation of the input. Neural network 210 creates vector 214 and neural network 212 creates vector 216. Similarly, any additional input data modalities are processed through neural networks to create vector representations of the inputs. The process of creating a vector representation is referred to as embedding. Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”  Here, Onoro discloses, for multimodal content pairs (“certain products with multiple data modalities… image 204, text 206”), a first modality feature vector (“For example, image 204 is processed by a neural network 210… Neural network 210 creates vector 214”) and a second modality feature vector (“Similarly, text 206 is processed be a neural network 212…neural network 212 creates vector 216”).   Onoro discloses, for these vectors, forming a combined multimodal feature vector (“Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220”)).
and semantically embedding the respective, combined multimodal feature vectors in the common geometric space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.  (Onoro, Para [0033], discloses:  “Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”  Here, Onoro discloses semantically embedding the respective, combined multimodal feature vectors in space (“produces an embedding of the common space”).  One of ordinary skill in the art will appreciate that, with the exception of concatenation, the remaining vector operations disclosed by Onoro (“point-wise multiplication, average, difference”) require that the vector operands be of the same dimension, and they produce a vector result also of the same dimension.  Thus, the combined embedding is in the common geometric space, the same as which the first and second modality feature vectors are in.  Thus, when combined with Gao’s individual modality feature vectors in a common geometric space, one can capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors.  As Gao has established semantic similarity measure between individual modality vectors, a measure of semantic similarity such as the cosine distance suggested by Gao can be applied between individual modality feature vectors, or between one of the individual modality feature vectors and a multimodal feature vector as suggested by Onoro.)
Onoro and the combination of Gao, Yu, and Nickel are analogous art because they are in the field of endeavor of machine learning.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal semantic and user embedding in hyperbolic space of the combination of Gao, Yu, and Nickel, with the multimodal feature vector of Onoro. The modification would have been obvious because one of ordinary skill in the art would be motivated to more accurately classify new multimodal content (Onoro, [0036]:  “Embodiments allow the system to perform zero-shot learning based on multi-modal data. In this way, explicit relationships between products and other entities do not need to be made. The simultaneous use of multiple data modalities (images, audio… ) to represent a single instance of an entity aids in zero-shot learning.”)

As per Claim 3, the combination of Gao, Yu, Nickel, and Onoro teaches the method of claim 2. Gao teaches further comprising: semantically embedding content-related information in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector.  (As shown in Claim 1 Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space. The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310”.  Here, Gao discloses semantically embedding content-related information in the common geometric space (“The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space”) based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector (“The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310”).
However, Gao does not teach embedded combined multimodal feature vector; including at least one of user information and user grouping information
Onoro teaches embedded combined multimodal feature vector  (As shown in Claim 2 Onoro, [Para 0033], discloses:  “Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”  When combined with Gao, together Gao and Onoro teach based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector, as since they all arranged by similarity (via distance) in geometric space, a relationship with one embedded feature vector is therefore a relationship with all embedded feature vectors, as they are all related to each other by distance in the geometric space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Onoro with the combination of Gao, Yu, and Nickel for at least the reasons recited in Claim 2.
However, the combination of Gao and Onoro thus far fails to teach including at least one of user information and user grouping information.
Yu teaches including user grouping information (Yu discloses embedding user information in a common geometric space with content in Page 451 Section 3.6:  “When recommending microblogs, given a microblog dj and a user uk, we compute the cosine distance between their vector representations, and use the cosine distance to determine whether dj should be recommended to uk or not.”  Here, a cosine distance is calculated between vector representations, indicating they are in a common geometric space.  One of ordinary skill in the art will appreciate that mapping users with similar interests in a common geometric space amounts to grouping the users by interest, as users that like similar content will be mapped closer together in the common geometric space.  Thus, user grouping information is embedded in the common geometric space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Yu with the combination of Gao and Nickel for at least the reasons recited in Claim 1.

As per Claim 7, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as embedded, first modality feature vector and embedded, second modality feature vector.  Yu teaches wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector (Recall that Gao teaches feature vectors.  Yu, Page 450 Section 3.1 discloses:  “We assume that a user tweeting, retweeting or commenting on a microblog text reflects that the user is interested in that microblog.”  Here, a user who tweets content is an agent.  Yu also teaches embeddings for users in Page 451 Section 3.6:  “When recommending microblogs, given a microblog dj and a user uk, we compute the cosine distance between their vector representations, and use the cosine distance to determine whether dj should be recommended to uk or not.”  Here, a cosine distance is calculated between vector representations, indicating they are in a common geometric space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Yu with the combination of Gao and Nickel for at least the reasons recited in Claim 1.
However, the combination of Gao, Yu, and Nickel does not teach embedded combined multimodal feature vector.  
Onoro teaches embedded combined multimodal feature vector  (As shown in Claim 2 Onoro, [Para 0033], discloses:  “Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Onoro with the combination of Gao, Yu, and Nickel for at least the reasons recited in Claim 2.

As per Claim 14, Claim 14 is an apparatus claim corresponding to method Claim 2.  The difference is it recites a processor and a memory. (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”)  Claim 14 is rejected for the same reasons as Claim 2.

As per Claim 15, the combination of Gao, Yu, and Nickel teaches the apparatus of claim 13. Gao teaches wherein the apparatus is further configured to:  semantically embed content-related information in the common geometric space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector.  (As shown in Claim 1 Gao, Para [0056], discloses:  “The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space. The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310”.  Here, Gao discloses semantically embedding content-related information in the common geometric space (“The image model 302 can map an image representation to an image vector 306 in a hidden space. The text model 304 can map a text vector 308 in the same hidden space”) based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector (“The similarity between the image and the sentence can be computed as the cosine distance between the image vector 306 and the text vector 308. The computed cosine distance can be outputted as a cosine semantic similarity 310”).
However, Gao does not teach embedded combined multimodal feature vector; including at least one of user information and user grouping information
Onoro teaches embedded combined multimodal feature vector  (As shown in Claim 2 Onoro, [Para 0033], discloses:  “Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”  When combined with Gao, together Gao and Onoro teach based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector, as since they all arranged by similarity (via distance) in geometric space, a relationship with one embedded feature vector is therefore a relationship with all embedded feature vectors, as they are all related to each other by distance in the geometric space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Onoro with the combination of Gao, Yu, and Nickel for at least the reasons recited in Claim 2.
However, the combination of Gao and Onoro thus far fails to teach including at least one of user information and user grouping information.
Yu teaches including user grouping information (Yu discloses embedding user information in a common geometric space with content in Page 451 Section 3.6:  “When recommending microblogs, given a microblog dj and a user uk, we compute the cosine distance between their vector representations, and use the cosine distance to determine whether dj should be recommended to uk or not.”  Here, a cosine distance is calculated between vector representations, indicating they are in a common geometric space.  One of ordinary skill in the art will appreciate that mapping users with similar interests in a common geometric space amounts to grouping the users by interest, as users that like similar content will be mapped closer together in the common geometric space.  Thus, user grouping information is embedded in the common geometric space.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Yu with the combination of Gao and Nickel for at least the reasons recited in Claim 1.

As per Claim 18, Claim 18 is a non-transitory computer-readable medium claim corresponding to method Claim 2.  The difference is it recites a processor and a non-transitory computer-readable medium. (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”).  Claim 18 is rejected for the same reasons as Claim 2.

As per Claim 19, Claim 19 is a non-transitory computer-readable medium claim corresponding to apparatus Claim 15.  The difference is it recites a non-transitory computer-readable medium. (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”).  Claim 19 is rejected for the same reasons as Claim 15.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Gao, Yu, and Nickel, in view of Onoro and further in view of Gao et. al. (US 2018/0336490 A1; hereinafter GaoTianshi).
As per Claim 6, the combination of Gao, Yu, and Nickel teaches the method of claim 1 as shown above, as well as embedded first modality feature vector and embedded second modality feature vector, and content-related information, including at least one of user information and user grouping information (see Rejection to Claim 1).  However, the combination of Gao, Yu, and Nickel does not teach embedded combined multimodal feature vector; further comprising: appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector.
Onoro teaches embedded combined multimodal feature vector  (As shown in Claim 2 Onoro, [Para 0033], discloses:  “Next, the vectors of the input sources are combined by an operation (OP′) which can be the concatenation, point-wise multiplication, average, difference or other operation. In the illustrated embodiment, vector 214 and vector 216 are combined by OP′ 218 into vector (Vec_X) 220. The operation process merges the information of various source modalities and produces an embedding of the common space.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Onoro with the combination of Gao, Yu, and Nickel for at least the reasons recited in Claim 2.
The combination of Gao, Yu, Nickel, and Onoro thus far fails to teach further comprising: appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector
GaoTianshi teaches further comprising: appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, [one embedded, second modality feature vector and one embedded, combined multimodal feature vector] (GaoTianshi, Para [0039], discloses:  “FIG. 3B illustrates a diagram for determining a score indicative of the likelihood of a user interacting with a content item, according to one embodiment. A user vector and a content vector are identified using the embedding representation 235. In the example of FIG. 3B, the user vector is generated by concatenating user sub-vectors {user.sub.page:emb_vec}, {user.sub.page:emb_vec}, {user.sub.app:emb_vec}, and {user.sub.word:emb_vec} from the embedding representation 235.”  Here, GaoTianshi discloses appending (“concatenating”) user information (“user sub-vectors”).  One of these user sub-vectors is {user.sub.word:emb_vec}, which is described by GaoTianshi in [0035] Lines 11-14:  “embedding sub-vector {user.sub.word:emb_vec} is identified from the word embedding representation 235D trained based on words included in available text documents”.  Thus, {user.sub.word:emb_vec} is an embedded, first modality feature vector where text is the modality, and GaoTianshi therefore discloses appending user information to an embedded, first modality feature vector.  When combining GaoTianshi’s concept of appending user information to a modality feature vector with the second and combined feature vectors of Gao and Onoro, this results in appending user information to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector).
	GaoTianshi and the combination of Gao, Yu, Nickel, and Onoro are analogous art because they are both in the field of endeavor of machine learning.
	It would have been obvious before the effective filing date of the claimed invention to combine the multimodal semantic embedding in hyperbolic space of the combination of Gao, Yu, Nickel, and Onoro with the user embedding in the semantic space of GaoTianshi.  One of ordinary skill in the art would be motivated to do so to make better user recommendations when new content becomes available (GaoTianshi, [0002]:  “Some online systems, such as a social networking system, provides content items to users based on models that attempt to score or rank the content available in the online system based on a likelihood that a user will be interested in the content item or based on a likelihood that the user will interact with the content. Those models are generated based on feedback signals. For instance, a user that has previously watched several videos related to soccer might be interested in a video that other soccer fans have previously watched. Such model may not be accurate when only a limited amount of feedback is available for a specific piece of content or for a specific user. That is, when a new content item is available for presentation to users, feedback for the content item to generate a model to predict the likelihood of a user being interested in the content item may not be available until a number of users have interacted with the content item.”)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yu et al. (“Modeling User Intrinsic Characteristic on Social Media for Identity Linkage”) discloses “an embedding method to model a topic as a vector in a latent space so as to interpret its deep semantics. Then a user is modeled as a vector based on his or her interactions with topics”
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
/L.A.S./Examiner, Art Unit 2126 
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126