DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2020-02-12 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
The amendment filed 2022-10-13 has been entered.  The status of claims is as follows:
Claims 1-20 are pending in the application.
Claims 1, 2, 5, 11-14, 16-17, and 20 are amended.

Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 112(b) on Remarks Page 8 have been fully considered and are persuasive.  The rejections are withdrawn in light of the amendments.
Applicant’s arguments with respect to rejections under 35 USC 101 for Claims 1-12 and 17-20 on Remarks Pages 9-12 have been fully considered and are persuasive.  The rejections are withdrawn in light of the amendments, which now positively recite a method of training, which is understood to not be a mental process per PEG Example 39 and MPEP 2106.04(a)(1)(vii).
Applicant’s arguments with respect to rejections under 35 USC 101 for Claims 13-16  on Remarks Pages 12-15 have been fully considered but are not persuasive.  The amended matter recites “projecting the combined multimodal feature vector into the trained embedding space, the embedding space having been trained by…”  Thus, what is claimed is “projecting” into a space that has already been trained.  Unlike in Claims 1-12 and 17-20, the method of training is not positively recited.  Examiner recommends positively reciting the training process as in Claims 1 and 17, and then following that up with the “projecting…” limitation.  This will overcome the rejection under 35 USC 101.
Applicant argues on Remarks Pages 12-14 that the claimed invention cannot practically be performed in the human mind.  The bottom of Page 13 to the top of Page 14 recites:  “In particular, the claimed steps of creating modality feature vectors and forming a combined vector cannot be practically performed in the human mind.”  In the next paragraph, Applicant states:  “The Applicant submits that it is absolutely impossible for a human in his or her mind or using a pen and paper to project a vector into an embedding space and even further to infer an intent of content based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent.”  Examiner respectfully disagrees and points out that creating a vector based on an already trained model may possibly be performed by a human with pen and paper, as there is no indication in the claims that the level of complexity of the trained model excludes a sufficiently simple model to be able to do so.
Applicant argues on Remarks Pages 14-15 that “In addition, the Courts have held that an element or combination of elements can qualify as significantly more if it results in improvements in another technology or technical field, improves the function of a computer itself, or adds a specific limitation not considered well-understood, routine, and/or conventional. The Applicant respectfully submits that the Applicant's claims at least add a specific limitation not considered well- understood, routine, and/or conventional…The Applicant asserts that these elements of the Applicant's independent claim 13 are not known, conventional, and/or routine and, as such, that the Applicant's claims recite significantly more than an abstract idea and recite patent eligible subject matter. The Applicant also asserts that the non-abstract elements of at least the Applicant's claims 13-16 are not general-purpose computers with a storage media or well-known, conventional and/or routine devices. As such, the Applicant submits that the Applicant's claims are tied to specific purpose novel machines and, as such, that the Applicant's claims are not directed to abstract ideas. In addition, the Applicant submits that the question of whether a claim is patentable under 35 U.S.C. §101 is not whether a generic computer was implemented to perform the novel features of the Applicant's claims but instead relies on whether the Applicant is claiming ordinary computer functionality. The Applicant is not claiming ordinary computer functionality and, moreover, the Applicant's claims 13-16 claim specific machines that rise far above the implementation of generic computers.”  Examiner respectfully disagrees, and points out that there is no language in the claims nor in the Specification that indicates that the claimed method is performed by anything other than a general purpose computer executing a program comprising instructions to perform the claimed steps.
Applicant’s arguments with respect to rejections under 35 USC 102 and 103 on Remarks Pages 15-26 have been fully considered and are persuasive.  As per the newly amended matter, Schifanella does not “train” the “semantic embedding space”, because Schifanella merely concatenates the two modality vectors into the joint semantic embedding space.  As such, as new piece of art has been combined with Schifanella, thus rendering Applicant’s argument moot, as necessitated by the newly amended language.

Claim Objections
Claim 13 is objected to because of the following informalities:  it recites the limitation “projecting the combined multimodal feature vector into the the trained embedding space.”  Appropriate correction to remove one of the “the” words is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 18 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 18 recites the limitation "the common geometric space".  There is insufficient antecedent basis for this limitation in the claim.  Examiner is interpreting the limitation as “the embedding space”.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 13-16 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea, specifically a mental process, without significantly more. 
Step 1:
Claims 13-16 are directed to a method.  Therefore, the claims are directed to one of the four statutory categories of patent eligible subject matter.
Step 2A Prong 1:
Claim 13 recites:
for each of a plurality of content of the multimodal content having the first modality, creating a respective, first modality feature vector representative of content of the multimodal content having the first modality; creating a vector can be performed by a human with pen and paper, and is thus a mental process
for each of a plurality of content of the multimodal content having the second modality, creating a respective, second modality feature vector representative of content of the multimodal content having the second modality; creating a vector can be performed by a human with pen and paper, and is thus a mental process
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; creating a vector can be performed by a human with pen and paper, and is thus a mental process
projecting the combined multimodal feature vector into the trained embedding space, the embedding space having been trained by jointly embedding a plurality of combined multimodal feature vectors of multimodal content having at least a first modality and a second modality with respective taxonomy classes of intent having been assigned for each of the combined multimodal feature vectors, such that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents; projecting a vector into an embedding space can be performed by a human with pen and paper; Examiner notes that the details on how the space was trained is not positively recited, and merely provides details on the nature of the embedding space
and inferring an intent of the multimodal content represented by the combined multimodal feature vector based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent; inferring an intent and calculating a proximity between vectors can be performed in the human mind with pen and paper, and is thus a mental process
Step 2A Prong 2:
This judicial exception is not integrated into a practical application because there are no additional elements, outside of the mental process, recited in the claims.  The claims are directed to a mental process.
Step 2B:
The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as discussed above, there are no additional elements, outside of the mental process, recited in the claims.  The claims are directed to a mental process.
Dependent claims 14-16 are also directed to a mental process for the following reasons:
Claim 14 recites “determining if a first multimodal content associated with a first agent is in proximity to a desired intent; and suggesting alterations of the first multimodal content to the first agent such that the first multimodal content will be mapped into the common geometric space closer to the desired intent”; determining proximity and suggesting alterations can be performed by a human with pen and paper, and is thus a mental process.
Claim 15 recites “inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content”; inferring a relationship can be performed by a human with pen and paper, and is thus a mental process.
Claim 16 recites “wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes”; the claims are still directed to a mental process.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 7, 9, 12-13, 15, 17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Schifanella et al. (“Detecting Sarcasm in Multimodal Social Platforms”; hereinafter “Schifanella”) in view of Vukotic et al. (“Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking; hereinafter “Vukotic”).
As per Claim 1, Schifanella teaches a method of creating a semantic embedding space for multimodal content for determining intent of content, the method comprising: 
for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality using a first machine learning model (Schifanella, Page 1137, Left Column First Bullet, discloses a plurality of multimodal content:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Schifanella, Page 1142 Section 5.2 “Adapted Visual Representation (AVR)” discloses:  “We borrow a model trained on ImageNet exactly from [5], which is based on roughly one million images annotated with 1,000 object classes.” Here, Schifanella discloses a first modality (image) feature vector using a first machine learning model (output vector from a “model trained on ImageNet”)). 
for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality using a second machine learning model (Schifanella, Page 1142 Section 5.2 “Textual Features”, discloses:  “The NLP network is a two two layer perceptron based on unigrams only.”  Here, Schifanella discloses a second modality (text) feature vector using a second machine learning model (output vector of an “NLP network”)).
creating a first training set by, for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector (Schifanella, Page 1142 Section 5.2 “Multimodal Fusion via Deep Network Adaptation”, discloses “The concatenation layer has 4,608 neurons.”  Schifanella, Figure 3, illustrates the “Concatenation Layer” as a combined image (Visual) and text (NLP) vector:

    PNG
    media_image1.png
    272
    382
    media_image1.png
    Greyscale

Schifanella uses the concatenated vectors to train the sarcasm detection model, and thus they are a first training set, as shown in Schifanella Page 1142 end of Section 5.2:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic). Since in practice it is hard to find the global minimum in a deep neural network, we use Nesterov Stochastic Gradient Decent with a small random batch (size = 128). We  finish training after 30 epochs.)
creating a second training set by, for at least one first modality feature vector and second modality feature vector multimodal content pair, assigning at least one respective taxonomy class of intent (Schifanella, Figure 3 above, discloses “Sarcasm Detection”, thus disclosing a taxonomy comprising 2 classes of intent:  that the user intends to be sarcastic or the user intends to be non-sarcastic.  Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic).  Schifanella uses the taxonomy class to train the sarcasm detection model (along with the concatenated vectors), and thus they are a second training set, as shown in Schifanella Page 1142 end of Section 5.2:  “We  finish training after 30 epochs.”)
However, Schifanella does not explicitly teach training the semantic embedding space using a machine learning process by jointly embedding combined multimodal feature vectors of the first training set and respective, assigned taxonomy classes of intent of the second training set in the embedding space, wherein embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents.
Vukotic teaches training the semantic embedding space using a machine learning process by jointly embedding combined multimodal feature vectors of the first training set and respective, assigned taxonomy classes of intent of the second training set in the embedding space, wherein embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents. (Vukotic, Page 40 Figure 2, discloses:  

    PNG
    media_image2.png
    506
    378
    media_image2.png
    Greyscale

Vukotic, Bottom Left Page 40, discloses:  “Learning of the two cross-modal mappings is then performed simultaneously and they are forced to be as close as possible to each other's inverses by the symmetric architecture in the middle. A joint representation in the middle of the two cross-modal mappings is also formed while learning.”  Thus, Vukotic teaches training the embedding space by embedding combined multimodal feature vectors.  Vukotic also teaches on Page 40 bottom right:  “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”  Vukotic, Figure 3 shows:  
    PNG
    media_image3.png
    218
    381
    media_image3.png
    Greyscale

Thus, by measuring the cosine distance to determine similarity, Vukotic discloses that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents.  
Recall above that Schifanella discloses taxonomy classes of intent.  Vukotic discloses jointly embedding combined multimodal feature vectors of the first training set and respective multimodal targets.  Vukotic, Page 41 Section “Dataset”, discloses:  “In this task, there are two main concepts: anchors and targets. Anchors represent segments of interest within videos that a user would like to know more about. Targets represent potential segments of interests that might or might not be related with a specific anchor. The goal is to hyperlink relevant targets for each anchor by using multimodal approaches.”)
Schifanella and Vukotic are analogous art because they are both in the field of endeavor of machine learning to analyze multimodal media.
It would have been obvious before the effective filing date of the claimed invention to combine the assigning taxonomy classes of intent of sarcasm and non-sarcasm of Schifanella with the bidirectional deep neural networks for multimodal embedding of Vukotic.  The combination would result in Schifanella’s taxonomy classes of sarcasm and non-sarcasm being embedded by a sarcastic and non-sarcastic “target” media like Vukotic’s “target”, using Vukotic’s bidirectional deep neural networks, in which classification into an intent could be determined by a multimodal media’s embedding’s being closer in proximity to either one of the embedded sarcastic or non-sarcastic targets.  One of ordinary skill in the art would be motivated to do so in order to gain improved accuracy over Schifanella’s simple concatenation (Vukotic, Page 39 Section 2.2.1:  “A simple way to perform multimodal early fusion is by simply concatenating single-modal representations. This does not provide the best results, as each representation still belongs to its own representation space” and Vukotic Page 42 Final Paragraph:  “Multimodal embedding with bidirectional deep neural networks creates a common joint representation space where both modalities are projected from their initial representation spaces. This provides superior multimodal embeddings that bring significant improvement.”) and also to be able to perform future classifications by using a simple cosine distance rather than a more complex nonlinear classifier each time as in Schifanella (Vukotic Page 41 Top:  “The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”).

As per Claim 2, the combination of Schifanella and Vukotic teaches the method of claim 1. Schifanella teaches wherein an intent of multimodal content having a first modality and a second modality can be inferred using the trained embedding space by: 
creating a modality feature vector representative of the content of the multimodal content having the first modality;  (Schifanella, Page 1137, Left Column First Bullet, discloses a plurality of multimodal content:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Schifanella, Page 1142 Section 5.2 “Adapted Visual Representation (AVR)” discloses:  “We borrow a model trained on ImageNet exactly from [5], which is based on roughly one million images annotated with 1,000 object classes.” Here, Schifanella discloses a first modality (image) feature vector using a first machine learning model (output vector from a “model trained on ImageNet”)). 
creating a modality feature vector representative of the content of the multimodal content having the second modality; (Schifanella, Page 1142 Section 5.2 “Textual Features”, discloses:  “The NLP network is a two two layer perceptron based on unigrams only.”  Here, Schifanella discloses a second modality (text) feature vector using a second machine learning model (output vector of an “NLP network”)).
forming a combined multimodal feature vector of the modality feature vector representative of the content having the first modality and the modality feature vector representative of the content having the second modality; (Schifanella, Page 1142 Section 5.2 “Multimodal Fusion via Deep Network Adaptation”, discloses “The concatenation layer has 4,608 neurons.”  Schifanella, Figure 3, illustrates the “Concatenation Layer” as a combined image (Visual) and text (NLP) vector:

    PNG
    media_image1.png
    272
    382
    media_image1.png
    Greyscale

projecting the combined multimodal feature vector into the embedding space; (Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, of a Visual and NLP network (image and text modalities), for which “The concatenation layer has 4,608 neurons”.  Therefore, the concatenation vector is projected into 4,608-dimensional space.)
and inferring an intent of the multimodal content (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic))”.
However, Schifanella does not explicitly teach and inferring an intent of the multimodal content based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent.
Vukotic teaches projecting the combined multimodal feature vector into the embedding space;  (Vukotic, Page 40 Figure 2, discloses:  

    PNG
    media_image2.png
    506
    378
    media_image2.png
    Greyscale

Vukotic, Bottom Left Page 40, discloses:  “Learning of the two cross-modal mappings is then performed simultaneously and they are forced to be as close as possible to each other's inverses by the symmetric architecture in the middle. A joint representation in the middle of the two cross-modal mappings is also formed while learning.”)
and inferring an intent of the multimodal content based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent.  (Vukotic teaches on Page 40 bottom right:  “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”  Vukotic, Figure 3 shows:  
    PNG
    media_image3.png
    218
    381
    media_image3.png
    Greyscale

Thus, by measuring the cosine distance to determine similarity, Vukotic discloses that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents.  
Recall above that Schifanella discloses taxonomy classes of intent.  Vukotic discloses jointly embedding combined multimodal feature vectors of the first training set and respective multimodal targets.  Vukotic, Page 41 Section “Dataset”, discloses:  “In this task, there are two main concepts: anchors and targets. Anchors represent segments of interest within videos that a user would like to know more about. Targets represent potential segments of interests that might or might not be related with a specific anchor. The goal is to hyperlink relevant targets for each anchor by using multimodal approaches.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Schifanella and Vukotic for at least the reasons recited in Claim 1.

As per Claim 3, the combination of Schifanella and Vukotic teaches the method of claim 2. Schifanella teaches wherein the multimodal content is a social media posting (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”)

As per Claim 4, the combination of Schifanella and Vukotic teaches the method of claim 2. Schifanella teaches determining if a first multimodal content is [in proximity] to a desired intent (Schifanella, Figure 3 above, discloses “Sarcasm Detection”, thus disclosing a taxonomy comprising 2 classes of intent:  that the user intends to be sarcastic or the user intends to be non-sarcastic.  Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)).
However, Schifanella does not explicitly teach determining if a first multimodal content is in proximity to a desired intent
Vukotic teaches determining if a first multimodal content is in proximity to a desired intent (Recall above Schifanella discloses a desired intent.  Vukotic teaches on Page 40 bottom right:  “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Schifanella and Vukotic for at least the reasons recited in Claim 1.

As per Claim 7, the combination of Schifanella and Vukotic teaches the method of claim 1. Schifanella teaches further comprising: determining a contextual relationship between a first modality feature represented by the first modality feature vector of the multimodal content and a second modality feature represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses, between two modalities (text and image), a contextual relationship (sarcasm or non-sarcasm) Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)

As per Claim 9, the combination of Schifanella and Vukotic teaches the method of claim 1. Schifanella teaches further comprising: inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Semiotics is the study of signs.  Here, Schifanella discloses signs (“visual content”, “images”), and semiotics thereof (“study the interplay”).  Thus, Schifanella is inferring a semiotic relationship between the image and text.  Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)

As per Claim 12, the combination of Schifanella and Vukotic teaches the method of claim 1. Schifanella teaches further comprising: semantically embedding the respective, combined multimodal feature vectors including the respective at least one taxonomy class of intent in the embedding space. (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content, and the multimodal content will be in proximity to either other sarcastic or non-sarcastic content (the taxonomy classes of intent), depending on which side of the classification boundary it is on.)

As per Claim 13, Schifanella teaches A method of determining an intent of multimodal content, having at least a first modality and a second modality, using a trained embedding space, the method comprising: 
for each of a plurality of content of the multimodal content having the first modality, creating a respective, first modality feature vector representative of content of the multimodal content having the first modality; (Schifanella, Page 1137, Left Column First Bullet, discloses a plurality of multimodal content:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Schifanella, Page 1142 Section 5.2 “Adapted Visual Representation (AVR)” discloses:  “We borrow a model trained on ImageNet exactly from [5], which is based on roughly one million images annotated with 1,000 object classes.” Here, Schifanella discloses a first modality (image) feature vector using a first machine learning model (output vector from a “model trained on ImageNet”)). 
for each of a plurality of content of the multimodal content having the second modality, creating a respective, second modality feature vector representative of content of the multimodal content having the second modality; (Schifanella, Page 1142 Section 5.2 “Textual Features”, discloses:  “The NLP network is a two two layer perceptron based on unigrams only.”  Here, Schifanella discloses a second modality (text) feature vector using a second machine learning model (output vector of an “NLP network”)).
for each of a plurality of first modality feature vector and second modality feature vector multimodal content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector; (Schifanella, Page 1142 Section 5.2 “Multimodal Fusion via Deep Network Adaptation”, discloses “The concatenation layer has 4,608 neurons.”  Schifanella, Figure 3, illustrates the “Concatenation Layer” as a combined image (Visual) and text (NLP) vector:

    PNG
    media_image1.png
    272
    382
    media_image1.png
    Greyscale

projecting the combined multimodal feature vector into the [trained] embedding space; (Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, of a Visual and NLP network (image and text modalities), for which “The concatenation layer has 4,608 neurons”.  Therefore, the concatenation vector is projected into 4,608-dimensional space.)
and inferring an intent of the multimodal content (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic))”.
However, Schifanella does not explicitly teach projecting the combined multimodal feature vector into the trained embedding space, the embedding space having been trained by jointly embedding a plurality of combined multimodal feature vectors of multimodal content having at least a first modality and a second modality with respective taxonomy classes of intent having been assigned for each of the combined multimodal feature vectors, such that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents; and inferring an intent of the multimodal content represented by the combined multimodal feature vector based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent.
Vukotic teaches projecting the combined multimodal feature vector into the [trained] embedding space, the embedding space having been trained by jointly embedding a plurality of combined multimodal feature vectors of multimodal content havinq at least a first modality and a second modality with respective taxonomy classes of intent having been assigned for each of the combined multimodal feature vectors, such that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents; (Vukotic, Page 40 Figure 2, discloses:  

    PNG
    media_image2.png
    506
    378
    media_image2.png
    Greyscale

Vukotic, Bottom Left Page 40, discloses:  “Learning of the two cross-modal mappings is then performed simultaneously and they are forced to be as close as possible to each other's inverses by the symmetric architecture in the middle. A joint representation in the middle of the two cross-modal mappings is also formed while learning.”  Thus, Vukotic teaches training the embedding space by embedding combined multimodal feature vectors.  Vukotic also teaches on Page 40 bottom right:  “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”  Vukotic, Figure 3 shows:  
    PNG
    media_image3.png
    218
    381
    media_image3.png
    Greyscale

Thus, by measuring the cosine distance to determine similarity, Vukotic discloses that embedded combined multimodal feature vectors having related intents are closer together in the embedding space than embedded combined multimodal feature vectors having unrelated intents.  
Recall above that Schifanella discloses taxonomy classes of intent.  Vukotic discloses jointly embedding combined multimodal feature vectors of the first training set and respective multimodal targets.  Vukotic, Page 41 Section “Dataset”, discloses:  “In this task, there are two main concepts: anchors and targets. Anchors represent segments of interest within videos that a user would like to know more about. Targets represent potential segments of interests that might or might not be related with a specific anchor. The goal is to hyperlink relevant targets for each anchor by using multimodal approaches.”)
and inferring an intent of the multimodal content represented by the combined multimodal feature vector based on a proximity of the projected, combined multimodal feature vector to at least one other combined multimodal feature vector embedded in the embedding space having at least one assigned taxonomy class of intent. (Recall above that Schifanella discloses inferring an intent.  Vukotic as shown above discloses jointly embedding combined multimodal feature vectors of the first training set and respective multimodal targets, and obtaining a similarity measure of proximity (“cosine distance”)).
Schifanella and Vukotic are analogous art because they are both in the field of endeavor of machine learning to analyze multimodal media.
It would have been obvious before the effective filing date of the claimed invention to combine the assigning taxonomy classes of intent of sarcasm and non-sarcasm of Schifanella with the bidirectional deep neural networks for multimodal embedding of Vukotic.  The combination would result in Schifanella’s taxonomy classes of sarcasm and non-sarcasm being embedded by a sarcastic and non-sarcastic “target” media like Vukotic’s “target”, using Vukotic’s bidirectional deep neural networks, in which classification into an intent could be determined by a multimodal media’s embedding’s being closer in proximity to either one of the embedded sarcastic or non-sarcastic targets.  One of ordinary skill in the art would be motivated to do so in order to gain improved accuracy over Schifanella’s simple concatenation (Vukotic, Page 39 Section 2.2.1:  “A simple way to perform multimodal early fusion is by simply concatenating single-modal representations. This does not provide the best results, as each representation still belongs to its own representation space” and Vukotic Page 42 Final Paragraph:  “Multimodal embedding with bidirectional deep neural networks creates a common joint representation space where both modalities are projected from their initial representation spaces. This provides superior multimodal embeddings that bring significant improvement.”) and also to be able to perform future classifications by using a simple cosine distance rather than a more complex nonlinear classifier each time as in Schifanella (Vukotic Page 41 Top:  “The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.”).

As per Claim 15, the combination of Schifanella and Vukotic teaches the method of claim 13. Schifanella teaches further comprising: inferring a semiotic relationship between a first modality represented by the first modality feature vector of the multimodal content and a second modality represented by the second modality feature vector of the multimodal content. (Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Semiotics is the study of signs.  Here, Schifanella discloses signs (“visual content”, “images”), and semiotics thereof (“study the interplay”).  Thus, Schifanella is inferring a semiotic relationship between the image and text.  Schifanella, Figure 3 and Page 1142 shown above, discloses a “Concatenation Layer”, for which “The concatenation layer has 4,608 neurons”, and thus the concatenation layer comprises a vector for each modality.)
As per Claim 17, this is a non-transitory computer-readable medium claim corresponding to Claim 1.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Claim 17 is rejected for the same reasons as Claim 1.

As per Claim 19, this is a non-transitory computer-readable medium claim corresponding to Claim 9.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Claim 19 is rejected for the same reasons as Claim 9.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 14, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Schifanella and Vukotic in view of Arras et al. (“Explaining Recurrent Neural Network Predictions in Sentiment Analysis”; hereinafter “Arras”).
As per Claim 5, the combination of Schifanella and Vukotic teaches the method of Claim 4 as well as multimodal content and mapped to the embedding space (see Rejection to Claim 1).  However, the combination of Schifanella and Vukotic does not teach further comprising: suggesting alterations of the first multimodal content such that the altered first multimodal content, if mapped to the common geometric space, would be closer to the desired intent.
Arras teaches further comprising: suggesting alterations of the first [multimodal] content such that the altered first [multimodal] content[, if mapped to the embedding space,] would be closer to the desired intent. (Arras, Page 4 Top Right, discloses:  “Furthermore, LRP explains well the two sentences that are mistakenly classified as “very positive” and “positive” (examples 11 and 17), by accentuating the negative relevance (blue colored) of terms speaking against the target class, i.e. the class “very negative”, such as must-see list, remember and future, whereas such understanding is not provided by the SA heatmaps. The same holds for the misclassified “very positive” sentence (example 21), where the word fails gets attributed a deep negatively signed relevance (blue colored).”  Here, Arras discloses a classifier that reveals explanations of the source of overall sentiment analysis of a sentence, and these explanations comprise suggestions as to how to change the classification of the sentence.  For example, Arras discloses above that avoiding the use of a “positive” word like “must-see list” will help a sentiment classifier map the sentence to the desired intent, which is to be negative.   Arras also shows on Page 6 Top Right, another example of making alterations in order to change the classification of a sentence:  “On initially correctly classified sentences we delete words in decreasing order of their relevance value, and on initially falsely classified sentences we delete words in increasing order of their relevance. We additionally perform a random word deletion as an uninformative variant for comparison. Our results in terms of tracking the classification accuracy over the number of word deletions per sentence are reported in Fig. 3. These results show that, in both considered cases, deleting words in decreasing or increasing order of their LRP relevance has the most pertinent effect, suggesting that this relevance decomposition method is the most appropriate for detecting words speaking for or against a classifier’s decision.”)
Arras and the combination of Schifanella and Vukotic are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the sentiment classifier of the combination of Schifanella and Vukotic, with the explainable classifier of Arras.  One of ordinary skill in the art would be motivated to do so in order to gain transparency, and better understand the reasons behind the intent classification, and therefore allow a user to more effectively get their point across (Arras, Page 1 Intro: “As these models become increasingly predictive, one also needs to make sure that they work as intended, in particular, their decisions should be made as transparent as possible”, and Arras Page 7 Conclusion:  “We applied the extended LRP version to a bi-directional LSTM model for the sentiment prediction of sentences, demonstrating that the resulting word relevances trustworthy reveal words supporting the classifier’s decision for or against a specific class, and perform better than those obtained by a gradient-based decomposition. Our technique helps understanding and verifying the correct behavior of recurrent classifiers, and can detect important patterns in text datasets.”)

As per Claim 14, the combination of Schifanella and Vukotic teaches the method of Claim 13 as well as multimodal content and mapped to a common geometric space (see Rejection to Claim 1).  Schifanella teaches determining if a first multimodal content associated with a first agent is in proximity to a desired intent (Schifanella Page 1142 Section 5.2 discloses:  “We use the rectify function as the activation function on all the nonlinear layers except for the last layer, which uses softmax over the two classes (sarcastic vs. non-sarcastic)”.  Here, Schifanella is embedding the multimodal vectors in a common geometric (4,608-dimensional) space and using a binary classifier to classify the predetermined intent (sarcasm or non-sarcasm) of the multimodal content. One of ordinary skill in the art will appreciate that a binary classifier draws a boundary that divides geometric space between two classes.  Thus, cumulatively, pieces of sarcastic content are closer to one another than to pieces of non-sarcastic content, and vice versa.  Thus, the intent is inferred based on a proximity of other mapped content, and the multimodal content will be in proximity to either other sarcastic or non-sarcastic content, depending on which side of the classification boundary it is on.  Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses that this is applied to social media posts, which are posted by a user, and thus a first agent.)
However, the combination of Schifanella and Vukotic does not teach further comprising: suggesting alterations of the first multimodal content such that the altered first multimodal content, if mapped to the common geometric space, would be closer to the desired intent.
Arras teaches further comprising: suggesting alterations of the first [multimodal] content such that the altered first [multimodal] content[, if mapped to the common geometric space,] would be closer to the desired intent. (Arras, Page 4 Top Right, discloses:  “Furthermore, LRP explains well the two sentences that are mistakenly classified as “very positive” and “positive” (examples 11 and 17), by accentuating the negative relevance (blue colored) of terms speaking against the target class, i.e. the class “very negative”, such as must-see list, remember and future, whereas such understanding is not provided by the SA heatmaps. The same holds for the misclassified “very positive” sentence (example 21), where the word fails gets attributed a deep negatively signed relevance (blue colored).”  Here, Arras discloses a classifier that reveals explanations of the source of overall sentiment analysis of a sentence, and these explanations comprise suggestions as to how to change the classification of the sentence.  For example, Arras discloses above that avoiding the use of a “positive” word like “must-see list” will help a sentiment classifier map the sentence to the desired intent, which is to be negative.   Arras also shows on Page 6 Top Right, another example of making alterations in order to change the classification of a sentence:  “On initially correctly classified sentences we delete words in decreasing order of their relevance value, and on initially falsely classified sentences we delete words in increasing order of their relevance. We additionally perform a random word deletion as an uninformative variant for comparison. Our results in terms of tracking the classification accuracy over the number of word deletions per sentence are reported in Fig. 3. These results show that, in both considered cases, deleting words in decreasing or increasing order of their LRP relevance has the most pertinent effect, suggesting that this relevance decomposition method is the most appropriate for detecting words speaking for or against a classifier’s decision.”)
Arras and the combination of Schifanella and Vukotic are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the sentiment classifier of Schifanella, with the explainable classifier of Arras.  One of ordinary skill in the art would be motivated to do so in order to gain transparency, and better understand the reasons behind the intent classification, and therefore allow a user to more effectively get their point across (Arras, Page 1 Intro: “As these models become increasingly predictive, one also needs to make sure that they work as intended, in particular, their decisions should be made as transparent as possible”, and Arras Page 7 Conclusion:  “We applied the extended LRP version to a bi-directional LSTM model for the sentiment prediction of sentences, demonstrating that the resulting word relevances trustworthy reveal words supporting the classifier’s decision for or against a specific class, and perform better than those obtained by a gradient-based decomposition. Our technique helps understanding and verifying the correct behavior of recurrent classifiers, and can detect important patterns in text datasets.”)

As per Claim 18, this is a non-transitory computer-readable medium claim corresponding to method claim 5. The difference is that it recites a non-transitory computer-readable medium, a processor, and associated with a first agent.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Schifanella, Page 1137 First Bullet, discloses:  “We study the interplay between textual and visual content in sarcastic multimodal posts for three main social media platforms, i.e., Instagram, Tumblr and Twitter, and discuss a categorization of the role of images in sarcastic posts.”  Here, Schifanella discloses that this is applied to social media posts, which are posted by a user, and thus a first agent.  Claim 18 is rejected for the same reasons as Claim 5.

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Schifanella and Vukotic in view of Cui et al. (“Recognize user intents in online interactions from massive social media data”; hereinafter “Cui”), Sharma et al. (“Degree based Classification of Harmful Speech using Twitter Data”; hereinafter “Sharma”), and Kofler et al. (“Uploader Intent for Online Video: Typology, Inference, and Applications”; hereinafter “Kofler”).
As per Claim 6, the combination of Schifanella and Vukotic teaches the method of Claim 1 as well as intent is classified (see Rejection to Claim 1).  However, the combination of Schifanella and Vukotic does not teach wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes.
Cui teaches wherein intent is classified by a taxonomy comprising advocative, information, expressive[, provocative, entertainment, and exhibitionist] classes  (Cui, Page 12 Section III, discloses:  “We examine the taxonomy of speech acts manifested in online texts by reviewing a large number of microblogs, and identify 10 categories of user intents from online interactions.”  Cui includes in this taxonomy advocative (“Directive (D1): the user wants the listener (i.e. other users or organizations), (not) to do something, including subcategories such as advice, request and appeal.”), information (“Declarative (D3): the user announces objective information, like news posted by organizations”), and expressive (“Expressive (E1): the user expresses his/her attitudes or manners, including subcategories such as blessing, apology, comfort, appreciation and congratulation.”)).
Cui and the combination of Schifanella and Vukotic are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the intent categories of Cui with the multimodal sarcasm classifier of the combination of Schifanella and Vukotic.  Doing so would enable one to use the fusion of multimodal properties to identify a broader range of intents.  One of ordinary skill in the art would be motivated to do so in order to be able to be able to improve one’s ability to make decisions based on social media posts. (Cui, Page 11 Intro Para 1:  “Recognizing intents in users’ online interactive behavior from social media data can effectively identify users’ motives behind communication and provide important information to aid monitoring, analysis and decision-making for a variety of applications”)
However, the combination of Schifanella, Vukotic and Cui does not explicitly teach wherein intent is classified by a taxonomy comprising provocative, entertainment, and exhibitionist classes.
Sharma teaches wherein intent is classified by a taxonomy comprising provocative classes. (Sharma, Page 1 Abstract, Last sentence, discloses:  “We also propose supervised classification system for recognizing these respective harmful speech classes in the texts hence.”  Sharma, Page 3 under Class II discloses provocation under the third bullet:  “Correlates between linguistic violence and nonlinguistic/demographic intimidating and trespassing someone in an online space. Can be highly provocative when addressing an individual rather than some ideology or community/group.”)
Sharma and the combination of Schifanella, Vukotic and Cui are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the provocative category of Sharma with the multimodal intent classifier of the combination of Schifanella, Vukotic and Cui.  One of ordinary skill in the art would be motivated to do so in order to be able to crack down on and hold accountable those who post provocative or hateful content (Sharma, Page 1 Abstract:  “Harmful speech has various forms and it has been plaguing the social media in different ways. If we need to crackdown different degrees of hate speech and abusive behavior amongst it, the classification needs to be based on complex ramifications which needs to be defined and hold accountable for, other than racist, sexist or against some particular group and community.”)
However, the combination of Schifanella, Vukotic, Cui, and Sharma does not explicitly teach wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes.
Kofler teaches wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes. (Kofler, Page 1200 Abstract, begins:  “We investigate automatic inference of uploader intent for online video, i.e., prediction of the reason for which a user has uploaded a particular video to the Internet”.  Kofler, Page 1204 Table 1, discloses entertainment (“Entertaining (UIEN):  purely entertain its viewers”) and exhibitionist (“Sharing (UISH):  share a (real-life) experience or event to viewers of the video”).
Kofler and the combination of Schifanella, Vukotic, Cui, and Sharma are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the entertainment and exhibitionist categories of Kofler with the multimodal intent classifier of the combination of Schifanella, Vukotic, Cui, and Sharma.  One of ordinary skill in the art would be motivated to do so in order to improve searching for relevant posts and learn how to create more effective promotional social media posts (Kofler, Page 1200 Intro Para 2:  “Our investigation of uploader intent for online video is motivated by the wide variety of application areas that ultimately stand to benefit from information on the reasons which prompt users to upload videos. These areas cover a diverse spectrum including video production and video search. For example, in the area of video production, knowledge about uploader intent could improve video authoring tools, guiding the user in producing videos with a fitting ‘look and feel’, for instance, by automatically recommending editing templates or Instagram-like filters. Another example is that uploader intent could aid the automatic matching of advertisements to videos, by providing information concerning the intended target audience of a video”)

As per Claim 16, the combination of Schifanella and Vukotic teaches the method of Claim 13 as well as intent is classified (see Rejection to Claim 13).  However, the combination of Schifanella and Vukotic does not teach wherein intent is classified by a taxonomy comprising advocative, information, expressive, provocative, entertainment, and exhibitionist classes.
Cui teaches wherein intent is classified by a taxonomy comprising advocative, information, expressive[, provocative, entertainment, and exhibitionist] classes  (Cui, Page 12 Section III, discloses:  “We examine the taxonomy of speech acts manifested in online texts by reviewing a large number of microblogs, and identify 10 categories of user intents from online interactions.”  Cui includes in this taxonomy advocative (“Directive (D1): the user wants the listener (i.e. other users or organizations), (not) to do something, including subcategories such as advice, request and appeal.”), information (“Declarative (D3): the user announces objective information, like news posted by organizations”), and expressive (“Expressive (E1): the user expresses his/her attitudes or manners, including subcategories such as blessing, apology, comfort, appreciation and congratulation.”)).
Cui and the combination of Schifanella and Vukotic are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the intent categories of Cui with the multimodal sarcasm classifier of the combination of Schifanella and Vukotic.  Doing so would enable one to use the fusion of multimodal properties to identify a broader range of intents.  One of ordinary skill in the art would be motivated to do so in order to be able to be able to improve one’s ability to make decisions based on social media posts. (Cui, Page 11 Intro Para 1:  “Recognizing intents in users’ online interactive behavior from social media data can effectively identify users’ motives behind communication and provide important information to aid monitoring, analysis and decision-making for a variety of applications”)
However, the combination of Schifanella, Vukotic and Cui does not explicitly teach wherein intent is classified by a taxonomy comprising provocative, entertainment, and exhibitionist classes.
Sharma teaches wherein intent is classified by a taxonomy comprising provocative classes. (Sharma, Page 1 Abstract, Last sentence, discloses:  “We also propose supervised classification system for recognizing these respective harmful speech classes in the texts hence.”  Sharma, Page 3 under Class II discloses provocation under the third bullet:  “Correlates between linguistic violence and nonlinguistic/demographic intimidating and trespassing someone in an online space. Can be highly provocative when addressing an individual rather than some ideology or community/group.”)
Sharma and the combination of Schifanella, Vukotic and Cui are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the provocative category of Sharma with the multimodal intent classifier of the combination of Schifanella, Vukotic and Cui.  One of ordinary skill in the art would be motivated to do so in order to be able to crack down on and hold accountable those who post provocative or hateful content (Sharma, Page 1 Abstract:  “Harmful speech has various forms and it has been plaguing the social media in different ways. If we need to crackdown different degrees of hate speech and abusive behavior amongst it, the classification needs to be based on complex ramifications which needs to be defined and hold accountable for, other than racist, sexist or against some particular group and community.”)
However, the combination of Schifanella, Vukotic, Cui, and Sharma does not explicitly teach wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes.
Kofler teaches wherein intent is classified by a taxonomy comprising entertainment and exhibitionist classes. (Kofler, Page 1200 Abstract, begins:  “We investigate automatic inference of uploader intent for online video, i.e., prediction of the reason for which a user has uploaded a particular video to the Internet”.  Kofler, Page 1204 Table 1, discloses entertainment (“Entertaining (UIEN):  purely entertain its viewers”) and exhibitionist (“Sharing (UISH):  share a (real-life) experience or event to viewers of the video”).
Kofler and the combination of Schifanella, Vukotic, Cui, and Sharma are analogous art because they are both in the field of endeavor of intent classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the entertainment and exhibitionist categories of Kofler with the multimodal intent classifier of the combination of Schifanella, Vukotic, Cui, and Sharma.  One of ordinary skill in the art would be motivated to do so in order to improve searching for relevant posts and learn how to create more effective promotional social media posts (Kofler, Page 1200 Intro Para 2:  “Our investigation of uploader intent for online video is motivated by the wide variety of application areas that ultimately stand to benefit from information on the reasons which prompt users to upload videos. These areas cover a diverse spectrum including video production and video search. For example, in the area of video production, knowledge about uploader intent could improve video authoring tools, guiding the user in producing videos with a fitting ‘look and feel’, for instance, by automatically recommending editing templates or Instagram-like filters. Another example is that uploader intent could aid the automatic matching of advertisements to videos, by providing information concerning the intended target audience of a video”)

Claims 8, 10, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Schifanella and Vukotic in view of Jaiswal et al. (“Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text”; hereinafter “Jaiswal”) and Zhang et al. (“Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text”; hereinafter “Zhang”)
As per Claim 8, the combination of Schifanella and Vukotic teaches the method of Claim 7 as well as contextual relationship (see Rejection to Claim 7).  However, the combination of Schifanella and Vukotic does not teach wherein the contextual relationship is classified by a taxonomy comprising minimal, close, and transcendent classes.
Jaiswal teaches wherein the contextual relationship is classified by a taxonomy comprising minimal, close[, and transcendent] classes.  (Recall above Schifanella teaches a contextual relationship.  Jaiswal, Page 1 Abstract, discloses:  “We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset.”  Here, Jaiswal discloses the “consistency” of the image with the caption.  This is used to classify as an inlier or outlier as shown in Jaiswal Page 6 Section 5:  “The inlier/outlier decisions of the ODMs in our system serve as the prediction of semantic information manipulation in query packages.”  Thus, Jaiswal discloses a classifier that determines a taxonomy of a minimal relationship (inconsistency of image and text, or “outlier”) or a close relationship (consistency of image and text, or “inlier”))
Jaiswal and the combination of Schifanella and Vukotic are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Jaiswal with the multimodal intent classifier of the combination of Schifanella and Vukotic.  One of ordinary skill in the art would be motivated to do so in order to prevent an inaccurate interpretation of a multimodal post if one of the modalities has been made to be intentionally misleading (Jaiswal Page 1-2 Intro:  “Independent existence of each modality makes multimedia data packages vulnerable to tampering, where the data in a subset of modalities of a multimedia package can be modified to misrepresent or repurpose the multimedia package. Such tampering, with possible malicious intent, can be misleading, if not dangerous. The location information, for example, in the aforementioned caption could be modified without an easy way to detect such tampering.”)
However, the combination of Schifanella, Vukotic and Jaiswal does not explicitly teach wherein the contextual relationship is classified by a taxonomy comprising transcendent class.
Zhang teaches wherein the contextual relationship is classified by a taxonomy comprising transcendent class (Recall above Schifanella discloses a contextual relationship. Zhang, Page 6 Section 4.1 Subsection 3, discloses:  “If the viewer can deduce the correct message from either channel alone, then using either text or image would be sufficient for advertising, and the relationship might be parallel, but if the meaning is unclear with one channel disabled, then both channels are indispensable, and the relationship might be non-parallel.”  Here, Zhang discloses a taxonomy comprising a close relationship (either image or text alone conveys the same message, or “parallel”) and a transcendent relationship (the image and text are both necessary to convey more meaning than either could on their own, or “non-parallel”)).
Zhang and the combination of Schifanella, Vukotic and Jaiswal are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Zhang with the multimodal intent classifier of the combination of Schifanella, Vukotic and Jaiswal.  One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of an automatic media interpreter by leveraging the recognition of implicit relationships (Zhang Page 1 Intro:  “While recent work has made advances in making literal connections between image and text (e.g., image captioning, where the text describes what is seen in the image [1, 6, 9, 16, 18, 35, 38, 40]), recognizing implicit relationships between image and text (e.g., metaphorical, symbolic, explanatory, ironic, etc.) remains a research challenge”).

As per Claim 10, the combination of Schifanella and Vukotic teaches the method of Claim 9 as well as semiotic relationship (see Rejection to Claim 9).  However, the combination of Schifanella and Vukotic does not teach wherein the semiotic relationship is classified by a taxonomy comprising divergent, parallel, and additive classes.
Jaiswal teaches wherein the semiotic relationship is classified by a taxonomy comprising divergent, parallel[, and additive] classes.  (Recall above Schifanella teaches a semiotic relationship.  Jaiswal, Page 1 Abstract, discloses:  “We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset.”  Here, Jaiswal discloses the “consistency” of the image with the caption.  This is used to classify as an inlier or outlier as shown in Jaiswal Page 6 Section 5:  “The inlier/outlier decisions of the ODMs in our system serve as the prediction of semantic information manipulation in query packages.”  Thus, Jaiswal discloses a classifier that determines a taxonomy of a divergent relationship (inconsistency of image and text, or “outlier”) or a parallel relationship (consistency of image and text, or “inlier”))
Jaiswal and the combination of Schifanella and Vukotic are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Jaiswal with the multimodal intent classifier of Schifanella and Vukotic.  One of ordinary skill in the art would be motivated to do so in order to prevent an inaccurate interpretation of a multimodal post if one of the modalities has been made to be intentionally misleading (Jaiswal Page 1-2 Intro:  “Independent existence of each modality makes multimedia data packages vulnerable to tampering, where the data in a subset of modalities of a multimedia package can be modified to misrepresent or repurpose the multimedia package. Such tampering, with possible malicious intent, can be misleading, if not dangerous. The location information, for example, in the aforementioned caption could be modified without an easy way to detect such tampering.”)
However, the combination of Schifanella, Vukotic and Jaiswal does not explicitly teach wherein the contextual relationship is classified by a taxonomy comprising additive class.
Zhang teaches wherein the contextual relationship is classified by a taxonomy comprising transcendent class (Recall above Schifanella teaches a semiotic relationship. Zhang, Page 6 Section 4.1 Subsection 3, discloses:  “If the viewer can deduce the correct message from either channel alone, then using either text or image would be sufficient for advertising, and the relationship might be parallel, but if the meaning is unclear with one channel disabled, then both channels are indispensable, and the relationship might be non-parallel.”  Here, Zhang discloses a taxonomy comprising a parallel relationship (either image or text alone conveys the same message, or “parallel”) and an additive relationship (the image and text are both necessary to convey more meaning than either could on their own, or “non-parallel”)).
Zhang and the combination of Schifanella, Vukotic and Jaiswal are analogous art because they are both in the field of endeavor of classification of social media.
It would have been obvious before the effective filing date of the claimed invention to combine the classification of the image-text relationship of Zhang with the multimodal intent classifier of the combination of Schifanella, Vukotic and Jaiswal.  One of ordinary skill in the art would be motivated to do so in order to improve the accuracy of an automatic media interpreter by leveraging the recognition of implicit relationships (Zhang Page 1 Intro:  “While recent work has made advances in making literal connections between image and text (e.g., image captioning, where the text describes what is seen in the image [1, 6, 9, 16, 18, 35, 38, 40]), recognizing implicit relationships between image and text (e.g., metaphorical, symbolic, explanatory, ironic, etc.) remains a research challenge”)

As per Claim 20, this is a non-transitory computer-readable medium claim corresponding to Claim 10.  The difference is that it recites a non-transitory computer-readable medium and a processor.  Schifanella, Page 1144 Bottom Left, discloses a memory and a processor:  “The challenges that prevent us from using more advanced textual features…a higher dimensionality brings difficulties for a fast neural network training due to the limitations of the GPU memory.” Claim 20 is rejected for the same reasons as Claim 10.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Schifanella and Vukotic in view of Nickel et al. (“Poincaré Embeddings for Learning Hierarchical Representations”; hereinafter “Nickel”)
As per Claim 11, the combination of Schifanella and Vukotic teaches the method of Claim 1.  However, the combination of Schifanella and Vukotic does not teach wherein the embedding space is a non-Euclidean common geometric space.  
Nickel teaches wherein the embedding space is a non-Euclidean common geometric space (Nickel, Page 2 Para 2, discloses:  “To exploit this structural property for  learning more efficient representations, we propose to compute embeddings not in Euclidean but in hyperbolic space, i.e., space with constant negative curvature.”)
Nickel and the combination of Schifanella and Vukotic are analogous art because they are in the field of endeavor of machine learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal embedding of the combination of Schifanella and Vukotic, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zadeh et al. (“Tensor Fusion Network for Multimodal Sentiment Analysis”) discloses on Page 5 Figure 4, a multimodal tensor to classify sentiment:

    PNG
    media_image4.png
    379
    690
    media_image4.png
    Greyscale

Wang et al. (“Convolutional neural networks and multimodal fusion for text aided image classification”) discloses on Page 2 Figure 1, multimodal fusion for classification:  
    PNG
    media_image5.png
    499
    821
    media_image5.png
    Greyscale

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/L.A.S./Examiner, Art Unit 2126  
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126