DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2020-02-12 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Status
Claims 1-20 are pending in the application.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2 and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The term “improvement” in claim 2 is a relative term which renders the claim indefinite. The term “improvement” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  Therefore, the limitation:  “such that determined related intents between multimodal content result in an improvement in recognition of influential impact of the multimodal content” is indefinite, as the meaning of an “improvement in recognition” has not been defined.  Furthermore, Examiner also notes that the “such that” language in this limitation amounts to purely functional claiming, and does not carry patentable weight.  See MPEP 2173.05(g):  “General Elec. Co., 304 U.S. at 370-71, 375…the Court further found that the phrase "of such size and contour as to prevent substantial sagging and offsetting during a normal or commercially useful life for a lamp or other device" did not adequately define the structural characteristics of the grains (e.g., the size and contour) to distinguish the claimed invention from the prior art.”
Claim 20 recites the limitation "the method of claim 19".  There is insufficient antecedent basis for this limitation in the claim.  Examiner is interpreting as “the non-transitory computer-readable medium of claim 19.”

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea, specifically a mental process, without significantly more. 
Step 1:
Claims 1-16 are directed to a method and Claims 17-20 are directed to a non-transitory computer-readable medium.  Therefore, the claims are directed to one of the four statutory categories of patent eligible subject matter.
Step 2A Prong 1:
Claims 1, 13, and 17 recite:
“for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model”; creating a vector can be performed by a human with pen and paper, and is thus a mental process
“for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model”; creating a vector can be performed by a human with pen and paper, and is thus a mental process
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector”; creating a vector can be performed by a human with pen and paper, and is thus a mental process
“for at least one first modality feature vector and second modality feature vector multimodal content pair, assigning at least one taxonomy class of intent”; assigning a taxonomy class of intent can be performed in the human mind or with pen and paper, and is thus a mental process
(Claims 1 and 17):  “semantically embedding the respective, combined multimodal feature vectors in a common geometric space, wherein embedded combined multimodal feature vectors having related intent are closer together in the common geometric space than unrelated multimodal feature vectors”; semantically embedding vectors in a geometric space can be performed by a human with pen and paper, and is thus a mental process
(Claim 13): “projecting the combined multimodal feature vector into the common geometric space”; projecting vectors in a geometric space can be performed by a human with pen and paper, and is thus a mental process
(Claim 13):  “and inferring an intent of the multimodal content represented by the combined multimodal feature vector based on the projection of the multimodal feature vector in the common geometric space and a classifier”; inferring an intent can be performed in the human mind, and is thus a mental process
Step 2A Prong 2:
This judicial exception is not integrated into a practical application because there are no additional elements, outside of the mental process, recited in the claims.  The claims are directed to a mental process.
Step 2B:
The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as discussed above, there are no additional elements, outside of the mental process, recited in the claims.  The claims are directed to a mental process.
Dependent claims 1-12, 14-16, and 18-20 are also directed to a mental process for the following reasons:
Claim 2 recites “projecting the combined multimodal feature vector into the common geometric space; and inferring an intent of the multimodal content mapped into the common geometric space based on a proximity of the mapped multimodal content to at least one other mapped multimodal content in the common geometric space having a predetermined intent such that determined related intents between multimodal content result in an improvement in recognition of influential impact of the multimodal content”; projecting vectors in a geometric space and inferring an intent can be performed by a human with pen and paper, and is thus a mental process
Claim 3 recites “where the multimodal content is a social media posting”; this does not integrate the judicial exception into a practical application, nor does it amount to significantly more than the judicial exception, because it amounts to merely linking the use of a judicial exception to a particular technological environment or field of use (see MPEP 2106.05(h)).
Claim 4 recites “determining if a first multimodal content is in proximity to a desired intent”; determining proximity can be performed by a human with pen and paper, and is thus a mental process.
Claim 5 recites “suggesting alterations of the first multimodal content such that the altered first multimodal content, if mapped to the common geometric space, would be closer to the desired intent”; suggesting alterations can be performed in the human mind, and is thus a mental process
Claims 6 and 16 recite “wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes”; the claims are still directed to a mental process.
Claim 7 recites “determining a contextual relationship between a first modality feature represented by the first modality feature vector of the multimodal content and a second modality feature represented by the second modality feature vector of the multimodal content”; determining a contextual relationship can be performed by a human with pen and paper, and is thus a mental process.
Claim 8 recites “wherein the contextual relationship is classified by a taxonomy comprising minimal, close, and transcendent classes”; the claim is still directed to a mental process.
Claims 9, 15, and 19 recite “inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content”; inferring a relationship can be performed by a human with pen and paper, and is thus a mental process.
Claims 10 and 20 recite “wherein the semiotic relationship is classified by a taxonomy comprising divergent, parallel, and additive classes”; the claims are still directed to a mental process.
Claim 11 recites “wherein the common geometric space is a non- Euclidean common geometric space”; the claim is still directed to a mental process.
Claim 12 recites “semantically embedding the respective, combined multimodal feature vectors including the respective at least one taxonomy class of intent in a common geometric space”; semantically embedding vectors can be performed by a human with pen and paper, and is thus a mental process.
Claims 14 and 18 recite “determining if a first multimodal content associated with a first agent is in proximity to a desired intent; and suggesting alterations of the first multimodal content to the first agent such that the first multimodal content will be mapped into the common geometric space closer to the desired intent”; determining proximity and suggesting alterations can be performed by a human with pen and paper, and is thus a mental process.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-4, 7, 9, 12-13, 15, 17, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Schifanella et al. (“Detecting Sarcasm in Multimodal Social Platforms”; hereinafter “Schifanella”).
As per Claim 1, Schifanella teaches a method of creating a semantic embedding space for multimodal content for determining intent of content, the method comprising: 
for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model (Schifanella, Page 1137, Left Column First Bullet, discloses a plurality of multimodal content:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Schifanella, Page 1142 Section 5.2 “Adapted Visual Representation (AVR)” discloses:  “We borrow a model trained on ImageNet exactly from [5], which is based on roughly one million images annotated with 1,000 object classes.” Here, Schifanella discloses a second modality (image) feature vector using a second machine learning model (output vector from a “model trained on ImageNet”)). 
for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model (Schifanella, Page 1142 Section 5.2 “Textual Features”, discloses:  “The NLP network is a two two layer perceptron based on unigrams only.”  Here, Schifanella discloses a second modality (text) feature vector using a second machine learning model (output vector of an “NLP network”)).
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector (Schifanella, Page 1142 Section 5.2 “Multimodal Fusion via Deep Network Adaptation”, discloses “The concatenation layer has 4,608 neurons.”  Schifanella, Figure 3, illustrates the “Concatenation Layer” as a combined image (Visual) and text (NLP) vector:

    PNG
    media_image1.png
    282
    399
    media_image1.png
    Greyscale

for at least one first modality feature vector and second modality feature vector multimodal content pair, assigning at least one taxonomy class of intent (Schifanella, Figure 3 above, discloses “Sarcasm Detection”, thus disclosing a taxonomy comprising 2 classes of intent:  that the user intends to be sarcastic or the user intends to be non-sarcastic.  Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic))”.
semantically embedding the respective, combined multimodal feature vectors in a common geometric space, wherein embedded combined multimodal feature vectors having related intent are closer together in the common geometric space than unrelated multimodal feature vectors (Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”.  Therefore, the concatenation vector is embedded in 4,608-dimensional space.  Schifanella, Page 1142 as shown above discloses a binary classifier (“We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic))”.  Here, Schifanella is embedding the vectors in a space to classify the semantic meaning (sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.)

As per Claim 2, Schifanella teaches the method of claim 1. Schifanella teaches wherein semantically embedding multimodal content into the common geometric space comprises: projecting a multimodal feature vector representing a first modality feature of the multimodal content and a second modality feature of the multimodal content into the common geometric space (Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, of a Visual and NLP network (image and text modalities), for which “The concatenation layer has 4,608 neurons”.  Therefore, the concatenation vector is projected into 4,608-dimensional space.)
inferring an intent of the multimodal content mapped into the common geometric space based on a proximity of the mapped multimodal content to at least one other mapped multimodal content in the common geometric space having a predetermined intent such that determined related intents between multimodal content result in an improvement in recognition of influential impact of the multimodal content. (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content.  Schifanella, Page 1142 Section 5.2 discloses:  “Since in practice it is hard to find the global minimum in a deep neural network, we use Nesterov Stochastic Gradient Decent with a small random batch (size = 128). We finish training after 30 epochs.  Thus, in Schifanella’s “training”, each training example is mapped into the geometric space, and the intent (sarcasm or non-sarcasm) is determined by proximity to the training examples, which established the classification boundary.
Examiner notes that the limitation “such that determined related intents between multimodal content result in an improvement in recognition of influential impact of the multimodal content” is indefinite, and also amounts to purely functional claiming that does not carry patentable weight (see 112 Rejections above).  Nevertheless, Examiner notes that Schifanella’s classification would result in an improvement in recognition of the influential impact of the content, in that it would result in an improvement in recognizing if the poster is trying to be influential by utilizing sarcasm.)

As per Claim 3, Schifanella teaches the method of claim 2. Schifanella teaches wherein the multimodal content is a social media posting (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”)

As per Claim 4, Schifanella teaches the method of claim 2. Schifanella teaches determining if a first multimodal content is in proximity to a desired intent (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content, and the multimodal content will be in proximity to either other sarcastic or non-sarcastic content, depending on which side of the classification boundary it is on.)

As per Claim 7, Schifanella teaches the method of claim 1. Schifanella teaches further comprising: determining a contextual relationship between a first modality feature represented by the first modality feature vector of the multimodal content and a second modality feature represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses, between two modalities (text and image), a contextual relationship (sarcasm or non-sarcasm) Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)

As per Claim 9, Schifanella teaches the method of claim 1. Schifanella teaches further comprising: inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Semiotics is the study of signs.  Here, Schifanella discloses signs (“visual content”, “images”), and semiotics thereof (“study the interplay”).  Thus, Schifanella is inferring a semiotic relationship between the image and text.  Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)

As per Claim 12, Schifanella teaches the method of claim 1. Schifanella teaches further comprising: semantically embedding the respective, combined multimodal feature vectors including the respective at least one taxonomy class of intent in a common geometric space. (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content, and the multimodal content will be in proximity to either other sarcastic or non-sarcastic content (the taxonomy classes of intent), depending on which side of the classification boundary it is on.)

As per Claim 13, Schifanella teaches a method of creating a semantic embedding space for multimodal content for determining intent of content, the method comprising: 
for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model (Schifanella, Page 1137, Left Column First Bullet, discloses a plurality of multimodal content:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Schifanella, Page 1142 Section 5.2 “Adapted Visual Representation (AVR)” discloses:  “We borrow a model trained on ImageNet exactly from [5], which is based on roughly one million images annotated with 1,000 object classes.” Here, Schifanella discloses a second modality (image) feature vector using a second machine learning model (output vector from a “model trained on ImageNet”)). 
for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model (Schifanella, Page 1142 Section 5.2 “Textual Features”, discloses:  “The NLP network is a two two layer perceptron based on unigrams only.”  Here, Schifanella discloses a second modality (text) feature vector using a second machine learning model (output vector of an “NLP network”)).
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector (Schifanella, Page 1142 Section 5.2 “Multimodal Fusion via Deep Network Adaptation”, discloses “The concatenation layer has 4,608 neurons.”  Schifanella, Figure 3, illustrates the “Concatenation Layer” as a combined image (Visual) and text (NLP) vector:

    PNG
    media_image1.png
    282
    399
    media_image1.png
    Greyscale

for at least one first modality feature vector and second modality feature vector multimodal content pair, assigning at least one taxonomy class of intent (Schifanella, Figure 3 above, discloses “Sarcasm Detection”, thus disclosing a taxonomy comprising 2 classes of intent:  that the user intends to be sarcastic or the user intends to be non-sarcastic.  Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic))”.
projecting a multimodal feature vector representing a first modality feature of the multimodal content and a second modality feature of the multimodal content into the common geometric space (Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, of a Visual and NLP network (image and text modalities), for which “The concatenation layer has 4,608 neurons”.  Therefore, the concatenation vector is projected into 4,608-dimensional space.)
inferring an intent of the multimodal content represented by the combined multimodal feature vector based on the projection of the multimodal feature vector in the common geometric space and a classifier (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content.)

As per Claim 15, Schifanella teaches the method of claim 13. Schifanella teaches further comprising: inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Semiotics is the study of signs.  Here, Schifanella discloses signs (“visual content”, “images”), and semiotics thereof (“study the interplay”).  Thus, Schifanella is inferring a semiotic relationship between the image and text.  Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)
As per Claim 17, this is a non-transitory computer-readable medium claim corresponding to Claim 1.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Claim 17 is rejected for the same reasons as Claim 1.

As per Claim 19, this is a non-transitory computer-readable medium claim corresponding to Claim 9.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 19 is rejected for the same reasons as Claim 9.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 14, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Schifanella in view of Arras et al. (“Explaining Recurrent Neural Network Predictions in Sentiment Analysis”; hereinafter “Arras”).
As per Claim 5, Schifanella teaches the method of Claim 4 as well as multimodal content and mapped to a common geometric space (see Rejection to Claim 1).  However, Schifanella does not teach further comprising: suggesting alterations of the first multimodal content such that the altered first multimodal content, if mapped to the common geometric space, would be closer to the desired intent.
Arras teaches further comprising: suggesting alterations of the first [multimodal] content such that the altered first [multimodal] content[, if mapped to the common geometric space,] would be closer to the desired intent. (Arras, Page 4 Top Right, discloses:  “Furthermore, LRP explains well the two sentences that are mistakenly classified as “very positive” and “positive” (examples 11 and 17), by accentuating the negative relevance (blue colored) of terms speaking against the target class, i.e. the class “very negative”, such as must-see list, remember and future, whereas such understanding is not provided by the SA heatmaps. The same holds for the misclassified “very positive” sentence (example 21), where the word fails gets attributed a deep negatively signed relevance (blue colored).”  Here, Arras discloses a classifier that reveals explanations of the source of overall sentiment analysis of a sentence, and these explanations comprise suggestions as to how to change the classification of the sentence.  For example, Arras discloses above that avoiding the use of a “positive” word like “must-see list” will help a sentiment classifier map the sentence to the desired intent, which is to be negative.   Arras also shows on Page 6 Top Right, another example of making alterations in order to change the classification of a sentence:  “On initially correctly classified sentences we delete words in decreasing order of their relevance value, and on initially falsely classified sentences we delete words in increasing order of their relevance. We additionally perform a random word deletion as an uninformative variant for comparison. Our results in terms of tracking the classification accuracy over the number of word deletions per sentence are reported in Fig. 3. These results show that, in both considered cases, deleting words in decreasing or increasing order of their LRP relevance has the most pertinent effect, suggesting that this relevance decomposition method is the most appropriate for detecting words speaking for or against a classifier’s decision.”)
Arras and Schifanella are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the sentiment classifier of Schifanella, with the explainable classifier of Arras.  One of ordinary skill in the art would be motivated to do so in order to gain transparency, and better understand the reasons behind the intent classification, and therefore allow a user to more effectively get their point across (Arras, Page 1 Intro: “As these models become increasingly predictive, one also needs to make sure that they work as intended, in particular, their decisions should be made as transparent as possible”, and Arras Page 7 Conclusion:  “We applied the extended LRP version to a bi-directional LSTM model for the sentiment prediction of sentences, demonstrating that the resulting word relevances trustworthy reveal words supporting the classifier’s decision for or against a specific class, and perform better than those obtained by a gradient-based decomposition. Our technique helps understanding and verifying the correct behavior of recurrent classifiers, and can detect important patterns in text datasets.”)

As per Claim 14, Schifanella teaches the method of Claim 13 as well as multimodal content and mapped to a common geometric space (see Rejection to Claim 1).  Schifanella teaches determining if a first multimodal content associated with a first agent is in proximity to a desired intent (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content, and the multimodal content will be in proximity to either other sarcastic or non-sarcastic content, depending on which side of the classification boundary it is on.  Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses that this is applied to social media posts, which are posted by a user, and thus a first agent.)
However, Schifanella does not teach further comprising: suggesting alterations of the first multimodal content such that the altered first multimodal content, if mapped to the common geometric space, would be closer to the desired intent.
Arras teaches further comprising: suggesting alterations of the first [multimodal] content such that the altered first [multimodal] content[, if mapped to the common geometric space,] would be closer to the desired intent. (Arras, Page 4 Top Right, discloses:  “Furthermore, LRP explains well the two sentences that are mistakenly classified as “very positive” and “positive” (examples 11 and 17), by accentuating the negative relevance (blue colored) of terms speaking against the target class, i.e. the class “very negative”, such as must-see list, remember and future, whereas such understanding is not provided by the SA heatmaps. The same holds for the misclassified “very positive” sentence (example 21), where the word fails gets attributed a deep negatively signed relevance (blue colored).”  Here, Arras discloses a classifier that reveals explanations of the source of overall sentiment analysis of a sentence, and these explanations comprise suggestions as to how to change the classification of the sentence.  For example, Arras discloses above that avoiding the use of a “positive” word like “must-see list” will help a sentiment classifier map the sentence to the desired intent, which is to be negative.   Arras also shows on Page 6 Top Right, another example of making alterations in order to change the classification of a sentence:  “On initially correctly classified sentences we delete words in decreasing order of their relevance value, and on initially falsely classified sentences we delete words in increasing order of their relevance. We additionally perform a random word deletion as an uninformative variant for comparison. Our results in terms of tracking the classification accuracy over the number of word deletions per sentence are reported in Fig. 3. These results show that, in both considered cases, deleting words in decreasing or increasing order of their LRP relevance has the most pertinent effect, suggesting that this relevance decomposition method is the most appropriate for detecting words speaking for or against a classifier’s decision.”)
Arras and Schifanella are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the sentiment classifier of Schifanella, with the explainable classifier of Arras.  One of ordinary skill in the art would be motivated to do so in order to gain transparency, and better understand the reasons behind the intent classification, and therefore allow a user to more effectively get their point across (Arras, Page 1 Intro: “As these models become increasingly predictive, one also needs to make sure that they work as intended, in particular, their decisions should be made as transparent as possible”, and Arras Page 7 Conclusion:  “We applied the extended LRP version to a bi-directional LSTM model for the sentiment prediction of sentences, demonstrating that the resulting word relevances trustworthy reveal words supporting the classifier’s decision for or against a specific class, and perform better than those obtained by a gradient-based decomposition. Our technique helps understanding and verifying the correct behavior of recurrent classifiers, and can detect important patterns in text datasets.”)

As per Claim 18, this is a non-transitory computer-readable medium claim corresponding to method claim 5. The difference is that it recites a non-transitory computer-readable medium, a processor, and associated with a first agent.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses that this is applied to social media posts, which are posted by a user, and thus a first agent.  Claim 18 is rejected for the same reasons as Claim 5.

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Schifanella in view of Cui et al. (“Recognize user intents in online interactions from massive social media data”; hereinafter “Cui”), Sharma et al. (“Degree based Classification of Harmful Speech using Twitter Data”; hereinafter “Sharma”), and Kofler et al. (“Uploader Intent for Online Video: Typology, Inference, and Applications”; hereinafter “Kofler”).
As per Claim 6, Schifanella teaches the method of Claim 1 as well as intent is classified (see Rejection to Claim 1).  However, Schifanella does not teach wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes.
Cui teaches wherein intent is classified by a taxonomy comprising advocative, information, expressive[, provocative, entertainment, and exhibitionist] classes  (Cui, Page 12 Section III, discloses:  “We examine the taxonomy of speech acts manifested in online texts by reviewing a large number of microblogs, and identify 10 categories of user intents from online interactions.”  Cui includes in this taxonomy advocative (“Directive (D1): the user wants the listener (i.e. other users or organizations), (not) to do something, including subcategories such as advice, request and appeal.”), information (“Declarative (D3): the user announces objective information, like news posted by organizations”), and expressive (“Expressive (E1): the user expresses his/her attitudes or manners, including subcategories such as blessing, apology, comfort, appreciation and congratulation.”)).
Cui and Schifanella are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the intent categories of Cui with the multimodal sarcasm classifier of Schifanella.  Doing so would enable one to use the fusion of multimodal properties to identify a broader range of intents.  One of ordinary skill in the art would be motivated to do so in order to be able to be able to improve one’s ability to make decisions based on social media posts. (Cui, Page 11 Intro Para 1:  “Recognizing intents in users’ online interactive behavior from social media data can effectively identify users’ motives behind communication and provide important information to aid monitoring, analysis and decision-making for a variety of applications”)
However, the combination of Schifanella and Cui does not explicitly teach wherein intent is classified by a taxonomy comprising provocative, entertainment, and exhibitionist classes.
Sharma teaches wherein intent is classified by a taxonomy comprising provocative classes. (Sharma, Page 1 Abstract, Last sentence, discloses:  “We also propose supervised classification system for recognizing these respective harmful speech classes in the texts hence.”  Sharma, Page 3 under Class II discloses provocation under the third bullet:  “Correlates between linguistic violence and nonlinguistic/demographic intimidating and trespassing someone in an online space. Can be highly provocative when addressing an individual rather than some ideology or community/group.”)
Sharma and the combination of Schifanella and Cui are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the provocative category of Sharma with the multimodal intent classifier of the combination of Schifanella and Cui.  One of ordinary skill in the art would be motivated to do so in order to be able to crack down on and hold accountable those who post provocative or hateful content (Sharma, Page 1 Abstract:  “Harmful speech has various forms and it has been plaguing the social media in different ways. If we need to crackdown different degrees of hate speech and abusive behavior amongst it, the classification needs to be based on complex ramifications which needs to be defined and hold accountable for, other than racist, sexist or against some particular group and community.”)
However, the combination of Schifanella, Cui, and Sharma does not explicitly teach wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes.
Kofler teaches wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes. (Kofler, Page 1200 Abstract, begins:  “We investigate automatic inference of uploader intent for online video, i.e., prediction of the reason for which a user has uploaded a particular video to the Internet”.  Kofler, Page 1204 Table 1, discloses entertainment (“Entertaining (UIEN):  purely entertain its viewers”) and exhibitionist (“Sharing (UISH):  share a (real-life) experience or event to viewers of the video”).
Kofler and the combination of Schifanella, Cui, and Sharma are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the entertainment and exhibitionist categories of Kofler with the multimodal intent classifier of the combination of Schifanella, Cui, and Sharma.  One of ordinary skill in the art would be motivated to do so in order to improve searching for relevant posts and learn how to create more effective promotional social media posts (Kofler, Page 1200 Intro Para 2:  “Our investigation of uploader intent for online video is motivated by the wide variety of application areas that ultimately stand to benefit from information on the reasons which prompt users to upload videos. These areas cover a diverse spectrum including video production and video search. For example, in the area of video production, knowledge about uploader intent could improve video authoring tools, guiding the user in producing videos with a fitting ‘look and feel’, for instance, by automatically recommending editing templates or Instagram-like filters. Another example is that uploader intent could aid the automatic matching of advertisements to videos, by providing information concerning the intended target audience of a video”)

As per Claim 16, Schifanella teaches the method of Claim 13 as well as intent is classified (see Rejection to Claim 13).  However, Schifanella does not teach wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes.
Cui teaches wherein intent is classified by a taxonomy comprising advocative, information, expressive[, provocative, entertainment, and exhibitionist] classes  (Cui, Page 12 Section III, discloses:  “We examine the taxonomy of speech acts manifested in online texts by reviewing a large number of microblogs, and identify 10 categories of user intents from online interactions.”  Cui includes in this taxonomy advocative (“Directive (D1): the user wants the listener (i.e. other users or organizations), (not) to do something, including subcategories such as advice, request and appeal.”), information (“Declarative (D3): the user announces objective information, like news posted by organizations”), and expressive (“Expressive (E1): the user expresses his/her attitudes or manners, including subcategories such as blessing, apology, comfort, appreciation and congratulation.”)).
Cui and Schifanella are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the intent categories of Cui with the multimodal sarcasm classifier of Schifanella.  Doing so would enable one to use the fusion of multimodal properties to identify a broader range of intents.  One of ordinary skill in the art would be motivated to do so in order to be able to be able to improve one’s ability to make decisions based on social media posts. (Cui, Page 11 Intro Para 1:  “Recognizing intents in users’ online interactive behavior from social media data can effectively identify users’ motives behind communication and provide important information to aid monitoring, analysis and decision-making for a variety of applications”)
However, the combination of Schifanella and Cui does not explicitly teach wherein intent is classified by a taxonomy comprising provocative, entertainment, and exhibitionist classes.
Sharma teaches wherein intent is classified by a taxonomy comprising provocative classes. (Sharma, Page 1 Abstract, Last sentence, discloses:  “We also propose supervised classification system for recognizing these respective harmful speech classes in the texts hence.”  Sharma, Page 3 under Class II discloses provocation under the third bullet:  “Correlates between linguistic violence and nonlinguistic/demographic intimidating and trespassing someone in an online space. Can be highly provocative when addressing an individual rather than some ideology or community/group.”)
Sharma and the combination of Schifanella and Cui are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the provocative category of Sharma with the multimodal intent classifier of the combination of Schifanella and Cui.  One of ordinary skill in the art would be motivated to do so in order to be able to crack down on and hold accountable those who post provocative or hateful content (Sharma, Page 1 Abstract:  “Harmful speech has various forms and it has been plaguing the social media in different ways. If we need to crackdown different degrees of hate speech and abusive behavior amongst it, the classification needs to be based on complex ramifications which needs to be defined and hold accountable for, other than racist, sexist or against some particular group and community.”)
However, the combination of Schifanella, Cui, and Sharma does not explicitly teach wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes.
Kofler teaches wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes. (Kofler, Page 1200 Abstract, begins:  “We investigate automatic inference of uploader intent for online video, i.e., prediction of the reason for which a user has uploaded a particular video to the Internet”.  Kofler, Page 1204 Table 1, discloses entertainment (“Entertaining (UIEN):  purely entertain its viewers”) and exhibitionist (“Sharing (UISH):  share a (real-life) experience or event to viewers of the video”).
Kofler and the combination of Schifanella, Cui, and Sharma are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the entertainment and exhibitionist categories of Kofler with the multimodal intent classifier of the combination of Schifanella, Cui, and Sharma.  One of ordinary skill in the art would be motivated to do so in order to improve searching for relevant posts and learn how to create more effective promotional social media posts (Kofler, Page 1200 Intro Para 2:  “Our investigation of uploader intent for online video is motivated by the wide variety of application areas that ultimately stand to benefit from information on the reasons which prompt users to upload videos. These areas cover a diverse spectrum including video production and video search. For example, in the area of video production, knowledge about uploader intent could improve video authoring tools, guiding the user in producing videos with a fitting ‘look and feel’, for instance, by automatically recommending editing templates or Instagram-like filters. Another example is that uploader intent could aid the automatic matching of advertisements to videos, by providing information concerning the intended target audience of a video”)

Claims 8, 10, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Schifanella in view of Jaiswal et al. (“Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text”; hereinafter “Jaiswal”) and Zhang et al. (“Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text”; hereinafter “Zhang”)
As per Claim 8, Schifanella teaches the method of Claim 7 as well as contextual relationship (see Rejection to Claim 7).  However, Schifanella does not teach wherein the contextual relationship is classified by a taxonomy comprising minimal, close, and transcendent classes.
Jaiswal teaches wherein the contextual relationship is classified by a taxonomy comprising minimal, close[, and transcendent] classes.  (Recall above Schifanella teaches a contextual relationship.  Jaiswal, Page 1 Abstract, discloses:  “We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset.”  Here, Jaiswal discloses the “consistency” of the image with the caption.  This is used to classify as an inlier or outlier as shown in Jaiswal Page 6 Section 5:  “The inlier/outlier decisions of the ODMs in our system serve as the prediction of semantic information manipulation in query packages.”  Thus, Jaiswal discloses a classifier that determines a taxonomy of a minimal relationship (inconsistency of image and text, or “outlier”) or a close relationship (consistency of image and text, or “inlier”))
Jaiswal and Schifanella are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Jaiswal with the multimodal intent classifier of Schifanella.  One of ordinary skill in the art would be motivated to do so in order to prevent an inaccurate interpretation of a multimodal post if one of the modalities has been made to be intentionally misleading (Jaiswal Page 1-2 Intro:  “Independent existence of each modality makes multimedia data packages vulnerable to tampering, where the data in a subset of modalities of a multimedia package can be modified to misrepresent or repurpose the multimedia package. Such tampering, with possible malicious intent, can be misleading, if not dangerous. The location information, for example, in the aforementioned caption could be modified without an easy way to detect such tampering.”)
However, the combination of Schifanella and Jaiswal does not explicitly teach wherein the contextual relationship is classified by a taxonomy comprising transcendent class.
Zhang teaches wherein the contextual relationship is classified by a taxonomy comprising transcendent class (Recall above Schifanella discloses a contextual relationship. Zhang, Page 6 Section 4.1 Subsection 3, discloses:  “If the viewer can deduce the correct message from either channel alone, then using either text or image would be sufficient for advertising, and the relationship might be parallel, but if the meaning is unclear with one channel disabled, then both channels are indispensable, and the relationship might be non-parallel.”  Here, Zhang discloses a taxonomy comprising a close relationship (either image or text alone conveys the same message, or “parallel”) and a transcendent relationship (the image and text are both necessary to convey more meaning than either could on their own, or “non-parallel”)).
Zhang and the combination of Schifanella and Jaiswal are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Zhang with the multimodal intent classifier of the combination of Schifanella and Jaiswal.  One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of an automatic media interpreter by leveraging the recognition of implicit relationships (Zhang Page 1 Intro:  “While recent work has made advances in making literal connections between image and text (e.g., image captioning, where the text describes what is seen in the image [1, 6, 9, 16, 18, 35, 38, 40]), recognizing implicit relationships between image and text (e.g., metaphorical, symbolic, explanatory, ironic, etc.) remains a research challenge”).

As per Claim 10, Schifanella teaches the method of Claim 9 as well as semiotic relationship (see Rejection to Claim 9).  However, Schifanella does not teach wherein the semiotic relationship is classified by a taxonomy comprising divergent, parallel, and additive classes.
Jaiswal teaches wherein the semiotic relationship is classified by a taxonomy comprising divergent, parallel[, and additive] classes.  (Recall above Schifanella teaches a semiotic relationship.  Jaiswal, Page 1 Abstract, discloses:  “We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset.”  Here, Jaiswal discloses the “consistency” of the image with the caption.  This is used to classify as an inlier or outlier as shown in Jaiswal Page 6 Section 5:  “The inlier/outlier decisions of the ODMs in our system serve as the prediction of semantic information manipulation in query packages.”  Thus, Jaiswal discloses a classifier that determines a taxonomy of a divergent relationship (inconsistency of image and text, or “outlier”) or a parallel relationship (consistency of image and text, or “inlier”))
Jaiswal and Schifanella are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Jaiswal with the multimodal intent classifier of Schifanella.  One of ordinary skill in the art would be motivated to do so in order to prevent an inaccurate interpretation of a multimodal post if one of the modalities has been made to be intentionally misleading (Jaiswal Page 1-2 Intro:  “Independent existence of each modality makes multimedia data packages vulnerable to tampering, where the data in a subset of modalities of a multimedia package can be modified to misrepresent or repurpose the multimedia package. Such tampering, with possible malicious intent, can be misleading, if not dangerous. The location information, for example, in the aforementioned caption could be modified without an easy way to detect such tampering.”)
However, the combination of Schifanella and Jaiswal does not explicitly teach wherein the contextual relationship is classified by a taxonomy comprising additive class.
Zhang teaches wherein the contextual relationship is classified by a taxonomy comprising transcendent class (Recall above Schifanella teaches a semiotic relationship. Zhang, Page 6 Section 4.1 Subsection 3, discloses:  “If the viewer can deduce the correct message from either channel alone, then using either text or image would be sufficient for advertising, and the relationship might be parallel, but if the meaning is unclear with one channel disabled, then both channels are indispensable, and the relationship might be non-parallel.”  Here, Zhang discloses a taxonomy comprising a parallel relationship (either image or text alone conveys the same message, or “parallel”) and an additive relationship (the image and text are both necessary to convey more meaning than either could on their own, or “non-parallel”)).
Zhang and the combination of Schifanella and Jaiswal are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Zhang with the multimodal intent classifier of the combination of Schifanella and Jaiswal.  One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of an automatic media interpreter by leveraging the recognition of implicit relationships (Zhang Page 1 Intro:  “While recent work has made advances in making literal connections between image and text (e.g., image captioning, where the text describes what is seen in the image [1, 6, 9, 16, 18, 35, 38, 40]), recognizing implicit relationships between image and text (e.g., metaphorical, symbolic, explanatory, ironic, etc.) remains a research challenge”)

As per Claim 20, this is a non-transitory computer-readable medium claim corresponding to Claim 10.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Claim 20 is rejected for the same reasons as Claim 10.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Schifanella in view of Nickel et al. (“Poincaré Embeddings for Learning Hierarchical Representations”; hereinafter “Nickel”)
As per Claim 11, Schifanella teaches the method of Claim 1.  However, Schifanella does not teach wherein the common geometric space is a non- Euclidean common geometric space.  
Nickel teaches wherein the common geometric space is a non- Euclidean common geometric space (Nickel, Page 2 Para 2, discloses:  “To exploit this structural property for  learning more efficient representations, we propose to compute embeddings not in Euclidean but in hyperbolic space, i.e., space with constant negative curvature.”)
Nickel and Schifanella are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal embedding of Schifanella, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Dragomir et al. (“Combining visual and textual attention in neural models for enhanced visual question answering”) discloses a fusion vector of text and image in Page 12 Figure 5:

    PNG
    media_image2.png
    399
    1083
    media_image2.png
    Greyscale

Vielzeuf et al. (“CentralNet: a Multilayer Approach for Multimodal Fusion) discloses a fusion vector of image and text, in order to determine the genre of a film, and is thus a multimodal multi-classifier as shown in Page 3 Figure 1:

    PNG
    media_image3.png
    164
    474
    media_image3.png
    Greyscale

Xu ("Analyzing multimodal public sentiment based on hierarchical semantic attentional network") discloses a fusion vector for text and image, for the purposes of sentiment classification (Positive, Neutral, Negative), in Page 153 Figure 2:

    PNG
    media_image4.png
    230
    343
    media_image4.png
    Greyscale

Anderson (US 2019/0042894 A1), Para [0045], discloses a multimodal fusion vector:  “Further, image content and textual description-based detection fusion may be used, such as relying on vector concatenation for both early and late fusion schemes to obtain a multimodal representation”
Li et al. (US 2019/0236450 A1) discloses in [0073]: “As mentioned, neural networks can perform multimodal deep learning in multiple domains such as visual, audio, and text. The domain-specific neural networks are used on different modalities to generate their representations, and the individual representations can be merged or aggregated. A prediction can be made from the aggregated representations, and in some cases an additional neural network is implemented to capture interactions between modalities and learn complex function mapping between input and output. In some example embodiments, addition (or average) and concatenation are two approaches for aggregation.”
Hori et al. (US 2018/0189572 A1) discloses in [0007]:  “The present disclosure is based on a multimodal fusion system that generates the content vectors from the input data that include multiple modalities. In some cases, the multimodal fusion system receives input signals including image (video) signals, motion signals and audio signals and generates a description narrating events relevant to the input signals.”
Lu et al. (US 2021/0256213 A1) discloses in [0025]:  “The attention network can generate a visual context vector from the image and the caption which can be integrated into a recurrent neural network”.
Habibian et al. (US 2017/0083623 A1) discloses in [0100]:  “In block, 910, the process computes a semantic embedding based on the feature projection and the textual projection.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126      
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126