DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1, 13 and 20 recites the limitation "the group" in line 14, 17 and 17 respectively.  There is insufficient antecedent basis for this limitation in the claim.
Claims 2-12 and 14-19 are also rejected as being dependent upon a rejected base claim.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-8, 11-15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (U.S. PATENT NO. 9965705 B2) in view of Izadinia (arXiv 1509.08075v1 2015), and further in view of He (arXiv 1703.06870v3  2018)  
-Regarding claim 1, Chen discloses a computer-implemented method (FIG. 2 FIG 5) for determining entailment between an input premise and an input hypothesis of different modalities (FIG. 2 box 210 question; box 205 image 206; Abstract), comprising: extracting, by a hardware processor (FIG. 8 units 801 807), features from the input hypothesis and an entirety of and regions of interest in the input premise (FIG.2 question embedding 212; feature map 208; Abstract “relevant regions”); deriving, by the hardware processor, intra-modal relevant information (col 9 line 32-34 “focus on regions associated to the question”) while suppressing intra-modal irrelevant information (col 4, line 34-35 “filter out noise and unrelated information”), based on intra-modal interactions (FIG.2 attention map 218) between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise (FIG. 4 step 415-425; FIG. 2 box 215); attaching, by the hardware processor (FIG. 8 units 801 807), cross-modal relevant information (col 8 line 37-40 “dense question embedding, and the attention weighted feature map”) to the features from the input premise to the features from the input hypothesis to form a cross- modal representation (col 8 section 4 “answer generation”; equations (4)-(5)) , based on cross-modal interactions between pairs (col 6 Section 1 attention extraction, “image-question pair”) of different elementary features from different modalities (FIG. 2 box 220; FIG. 4 steps 425-430); and classifying, by the hardware processor (FIG. 8 units 801 807), a relationship between the input premise and the FIG. 2 box 220; FIG. 5 step 525; col 8 line 67 – col 9 line 3; equation (6)).
Chen discloses visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. However, Chen is silent to teach the sentence in the box 210 can be an input hypothesis and determining entailment between an input premise and an input hypothesis of different modalities. A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation.
In the same field of endeavor, Izadinia teaches the phrases is an input hypothesis and the computer-implemented method for determining entailment (Izadinia: Figure 1 visual entailment; Abstract) between an input premise and an input hypothesis of different modalities (Izadinia : Figure 1 image, phrases). A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation (Izadinia : Figure 1; Figure 5; Section 4.1 Visual Entailment)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Chen with the teaching of Izadinia by introducing input sentence in box 210 as a hypotheses in order to solve close related visual entailment problem with the modalities of vision and language.

However, He is an analogous art pertinent to the problem to be solved in this application and further discloses wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network (He: Abstract; Page 1, Section 1 Introduction, paragraphs 3-4, Figure 1; Page 3, Section 3 mask R-CNN).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Chen in view of Izadinia with the teaching of He by using Mask Region-based Convolutional Neural Network to extracts the features from the regions of interest in order to use a conceptually simple and flexible framework, and preserve exact spatial locations during performing coarse spatial quantization for feature extraction.
-Regarding claim 2, the modification further discloses wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network (Chen: FIG. 2 box 206 CNN 207).
-Regarding claim 5, the modification further discloses wherein said extracting step comprises extracting region specific feature vectors for the input premise (Chen: col 3 line 47-50 “question guided attention, … regions are determined by both images and image-related questions”).
-Regarding claim 6, the modification further discloses wherein the regions of interest are specified at a feature map level (Chen: FIG. 2 box 215).
Chen: FIG. 4 step 415; FIG. 2 kernel 215).
-Regarding claim 8, the modification further discloses wherein said extracting step comprises forming a visual corpus from an existing textual corpus that includes textual premises and textual hypothesis by replacing the textual premises in the existing textual corpus with visual premises (Chen: FIG. 4 step 415; FIG. 2 kernel 215).
-Regarding claim 11, the modification further discloses  wherein the relationship between the input premise and the input hypothesis is classified using a softmax process (Chen: FIG. 2 box 220; FIG. 5 step 525; col 8 line 67 – col 9 line 3; equation (6))
-Regarding claim 12, Chen in view of Izadinia, and further in view of He discloses the method of claim 1.
Chen discloses wherein the input premise comprises an input image premise (Chen: FIG. 2 box 210 question; box 205 image 206).
Chen is silent to teach the input hypothesis comprises an input textual sequence hypothesis.
In the same field of endeavor, Izadinia teaches the input hypothesis comprises an input textual sequence hypothesis (Izadinia: Figure 1 visual entailment; Abstract).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching Chen with the teaching of Izadinia by introducing input sentence in box 210 as a hypotheses in order to solve close related visual entailment problem with the modalities of vision and language.
FIG. 8) for determining entailment between an input premise and an input hypothesis of different modalities (FIG. 2 box 210 question; box 205 image 206; Abstract), the computer program product comprising a non-transitory computer readable storage medium (FIG. 8 device 808)  having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising (FIG. 8 col 16: line 6-11, line 27-53): extracting, by a hardware processor (FIG. 8 units 801 807), features from the input hypothesis and an entirety of and regions of interest in the input premise (FIG.2 question embedding 212; feature map 208; Abstract “relevant regions”); deriving, by the hardware processor (FIG. 8 units 801 807), intra-modal relevant information (col 9 line 32-34 “focus on regions associated to the question”) while suppressing intra-modal irrelevant information (col 4, line 34-35 “filter out noise and unrelated information”), based on intra-modal interactions (FIG.2 attention map 218) between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise (FIG. 4 step 415-425; FIG. 2 box 215); attaching, by the hardware processor (FIG. 8 units 801 807), cross-modal relevant information (col 8 line 37-40 “dense question embedding, and the attention weighted feature map”) to the features from the input premise to the features from the input hypothesis to form a cross- modal representation (col 8 section 4 “answer generation”; equations (4)-(5)), based on cross-modal interactions between pairs (col 6 Section 1 attention extraction, “image-question pair”) of different elementary features from different modalities (FIG. 2 box 220; FIG. 4 steps 425-430); and classifying, by the hardware processor (FIG. 8 units 801 807), a FIG. 2 box 220; FIG. 5 step 525; col 8 line 67 – col 9 line 3; equation (6)).
Chen discloses visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. However, Chen is silent to teach the sentence in the box 210 can be an input hypothesis and determining entailment between an input premise and an input hypothesis of different modalities. A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation.
In the same field of endeavor, Izadinia teaches the phrases is an input hypothesis and the computer-implemented method for determining entailment (Izadinia: Figure 1 visual entailment; Abstract) between an input premise and an input hypothesis of different modalities (Izadinia : Figure 1 image, phrases). A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation (Izadinia : Figure 1; Figure 5; Section 4.1 Visual Entailment)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Chen with the teaching of Izadinia by introducing input sentence in box 210 as a hypotheses in order to solve close related visual entailment problem with the modalities of vision and language.

However, He is an analogous art pertinent to the problem to be solved in this application and further discloses wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network (He: Abstract; Page 1, Section 1 Introduction, paragraphs 3-4, Figure 1; Page 3, Section 3 mask R-CNN).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Chen in view of Izadinia with the teaching of He by using Mask Region-based Convolutional Neural Network to extracts the features from the regions of interest in order to use a conceptually simple and flexible framework, and preserve exact spatial locations during performing coarse spatial quantization for feature extraction.
-Regarding claim 14, the modification further discloses wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network (Chen: FIG. 2 box 206 CNN 207).
-Regarding claim 17, the modification further discloses wherein said extracting step comprises extracting region specific feature vectors for the input premise (Chen: col 3 line 47-50 “question guided attention, … regions are determined by both images and image-related questions”).
-Regarding claim 18, the modification further discloses wherein the regions of interest are specified at a feature map level (Chen: FIG. 2 box 215).
Chen: FIG. 4 step 415; FIG. 2 kernel 215).
-Regarding claim 20, Chen discloses a computer processing system (FIG. 8) for determining entailment between an input premise and an input hypothesis of different modalities (FIG. 2 FIGS 4-5), comprising: a memory device (FIG. 8 device 808) including program code stored thereon; a hardware processor (FIG. 8 units 801 807), operatively coupled to the memory device (FIG. 8), and configured to run the program code stored on the memory device (col 15: line 9-16; col 16: line 6-11, line 27-53) to extract features from the input hypothesis and an entirety of and regions of interest in the input premise (FIG.2 question embedding 212; feature map 208; Abstract “relevant regions”); derive intra-modal relevant information (col 9 line 32-34 “focus on regions associated to the question”) while suppressing intra-modal irrelevant information (col 4, line 34-35 “filter out noise and unrelated information”), based on intra-modal interactions (FIG.2 attention map 218) between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise (FIG. 4 step 415-425; FIG. 2 box 215); attach cross-modal relevant information (col 8 line 37-40 “dense question embedding, and the attention weighted feature map”) to the features from the input premise to the features from the input hypothesis to form a cross-modal representation (col 8 section 4 “answer generation”; equations (4)-(5)), based on cross-modal interactions between pairs (col 6 Section 1 attention extraction, “image-question pair”) of different elementary features from different modalities (FIG. 2 box 220; FIG. 4 steps 425-430); and classify a relationship between the input premise and the input hypothesis using a label selected FIG. 2 box 220; FIG. 5 step 525; col 8 line 67 – col 9 line 3; equation (6)).
Chen discloses visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. However, Chen is silent to teach the sentence in the box 210 can be an input hypothesis and determining entailment between an input premise and an input hypothesis of different modalities. A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation.
In the same field of endeavor, Izadinia teaches the phrases is an input hypothesis and the computer-implemented method for determining entailment (Izadinia: Figure 1 visual entailment; Abstract) between an input premise and an input hypothesis of different modalities (Izadinia : Figure 1 image, phrases). A relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation (Izadinia : Figure 1; Figure 5; Section 4.1 Visual Entailment)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Chen with the teaching of Izadinia by introducing input sentence in box 210 as a hypotheses in order to solve close related visual entailment problem with the modalities of vision and language.

However, He is an analogous art pertinent to the problem to be solved in this application and further discloses wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network (He: Abstract; Page 1, Section 1 Introduction, paragraphs 3-4, Figure 1; Page 3, Section 3 mask R-CNN).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Chen in view of Izadinia with the teaching of He by using Mask Region-based Convolutional Neural Network to extracts the features from the regions of interest in order to use a conceptually simple and flexible framework, and preserve exact spatial locations during performing coarse spatial quantization for feature extraction.
-Regarding claim 3 and claim 15, Chen in view of Izadinia discloses the methods of claim 2 and claim 14 respectively. 
Chen in view of Izadinia is silent to teach wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network.
However, He is an analogous art pertinent to the problem to be solved in this application and further discloses wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network (He: Abstract; Page 1, Section 1 Introduction, paragraphs 3-4, Figure 1; Page 3, Section 3 mask R-CNN).
.

Claims 4, 9-10 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (U.S. PATENT NO. 9965705 B2) in view of Izadinia (arXiv 1509.08075v1 2015), and further in view of He (arXiv 1703.06870v3  2018) and in view of Chen (U.S. PG-PUB NO. US 2018/0181592 A1)
-Regarding claims 4 and 16, Chen in view of Izadinia, and further in view of He discloses the methods of claim 1 and 13 respectively.
 Chen in view of Izadinia, and further in view of He is silent to teach wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features.
  However, Chen (U.S. PG-PUB NO. US 2018/0181592 A1) is an analogous art pertinent to the problem to be solved in this application and further discloses wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features (Chen: [0020]-[0021] “intra-attention”; [0063] [0067] [0069] [0072]).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching Chen in view of 
-Regarding claim 9, Chen in view of Izadinia, and further in view of He discloses wherein the intra-modal relevant information is derived by performing a word embedding on the input textual sequence to obtain a vector of real numbers (Chen: FIG. 2 box 210 215; col 7 Section 2 Question Understanding) 
Chen in view of Izadinia, and further in view of He is silent to teach wherein the intra-modal relevant information is derived by subjecting the vector of real numbers to a self-attention process.
  However, Chen (U.S. PG-PUB NO. US 2018/0181592 A1) is an analogous art pertinent to the problem to be solved in this application and further discloses wherein the intra-modal relevant information is derived by performing a word embedding on the input textual sequence to obtain a vector of real numbers, subjecting the vector of real numbers to a self-attention process (Chen: [0069]-[0070]). 
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Chen in view of Izadinia, and further in view of He with the teaching of Chen (U.S. PG-PUB NO. US 2018/0181592 A1) by using a self-attention process to identify the elementary ones of the features from an entirety of the features in order to better compute a representation of the sentence.
Chen: FIG. 8 GPU unit 817; col 10, Section 1 Implementation Details; col 11 line 5-8)
Chen in view of Izadinia, and further in view of He is silent to teach overall sentence hypothesis features from an output of a text self-attention process.
  However, Chen (U.S. PG-PUB NO. US 2018/0181592 A1) is an analogous art pertinent to the problem to be solved in this application and further discloses comprising deriving the cross-modal relevant information by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process (FIG. 3).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Chen in view of Izadinia, and further in view of He with the teaching of Chen (U.S. PG-PUB NO. US 2018/0181592 A1) by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to derive the cross-modal relevant information in order to fast and efficient to compute a representation of the sequence.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAO LIU whose telephone number is (571)272-4539.  The examiner can normally be reached on Monday-Thursday and Alternate Fridays 8:30-4:30.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nay Maung can be reached on (571) 272-7882.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/XIAO LIU/Examiner, Art Unit 2664                                                                                                                                                                                             
/PING Y HSIEH/Primary Examiner, Art Unit 2664