Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Action is responsive to the Amendments and Remarks filed in the U.S. on 9/15/2022. Claims 26-55 are pending claims.  Claims 26, 36, 46, and 51 are written in independent form. Claims 1-25 have been cancelled. Claims 51-55 are newly added.
Applicant’s amendments and remarks filed on 9/15/2022 have been fully considered but were not found to overcome the previously cited prior art. Accordingly, THIS ACTION IS MADE FINAL.

Double Patenting
Claim 51 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 26. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Both claims are directed to an apparatus that perform substantially the same steps.
Claim 52 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 29. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Both claims are directed to an apparatus that perform substantially the same steps.
Claim 53 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 33. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Both claims are directed to an apparatus that perform substantially the same steps.
Claim 54 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 34. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Both claims are directed to an apparatus that perform substantially the same steps.
Claim 55 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 35. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Both claims are directed to an apparatus that perform substantially the same steps.


35 USC § 101 - Comments
Claim 46 has been amended to recite “at least one storage device”. The specification of the present application explicitly states “the storage device 420 is a physical memory” (Paragraph [0040]) and thus “storage device” in the claims is being interpreted as necessarily referring to statutory subject matter.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 26, 29-31, 33, 36-40, 43, 46, and 50-53 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Trott et al. (U.S. Pre-Grant Publication No. 2019/0130206, hereinafter referred to as Trott).

Regarding Claim 26:
Trott teaches an apparatus for visual question answering, comprising:
an encoder to encode an input image and a question into a query representation, the query representation to include visual attention features;
Trott teaches receiving an image and a natural language question (Paras. [0020]-[0021] & Fig. 1). Trott teaches encoding each word of a question into a vector of features (Paras. [0030]-[0031]). Trott further teaches generating object coordinates and object embeddings for the input image where “object embeddings include an encoding representing the content of the image from a sub-region identified by a corresponding border in the generated object coordinates” (Para. [0048]).
a knowledge spotter to retrieve a knowledge entry from a visual knowledge base, the visual knowledge base pre-built on question-answer pairs;
Trott teaches receiving information on each object previously selected by object selector 760 (Para. [0041]) where training of the system for visually identifying objects is performed based on image-question pairs (Para. [0064]). Trott further teaches a COCO dataset used in assigning coco categories or background to each of the candidate objects where each coco category is associated with a question (Para. [0072]). Therefore, Trott teaches a pre-built visual knowledge base built on a set of question-answer pairs that is used for retrieving knowledge for object selection.
a joint embedder to jointly embed the visual attention features and the knowledge entry to generate visual knowledge features; and
Trott teaches scorer 550 for jointly embedding the question embedding and object embeddings to generate scoring vectors (Para. [0036]) where the generating a scoring vector based on the object embedding and question embeddings “includes a score indicating how well each of the objects in the generated object embeddings generated during process 820 matches the criteria encoded in the question embeddings generated during process 840” (Para. [0051]) thereby teaching a joint embedder to generate visual knowledge features.
an answer generator to generate an answer based on the query representation and the visual-knowledge features.
Trott teaches generating an answer to the question about the input image based on the combined query embedding and object embeddings (Para. 0019).

Regarding Claim 29:
Trott further teaches:
wherein the encoder includes a convolutional neural network (CNN) model to be used to encode the input image into an image vector, the question vector to include image embedding features.
Trott teaches “image processing module 420 includes a Faster R-CNN that generates xy and object embeddings v” (Para. [0029]) thereby teaching using a convolutional neural network model to encode the input image into an image vector v comprising object features from the image.

Regarding Claim 30:
Trott further teaches:
wherein the encoder includes a long short-term memory (LSTM) model to be used to encode the question into a question vector, the question vector to include question embedding features.
Trott teaches using an LSTM network “for language processing” and “which generates question embedding q” (Para. [0031]) which is the question feature vector comprising question embedding features (Paras. [0032]-[0033] & Fig. 4).

Regarding Claim 31:
Trott further teaches:
wherein the encoder is to jointly embed an output of a convolutional neural network (CNN) model and a long short-term memory (LSTM) model to generate the query representation.
Trott teaches scorer 550 for jointly embedding the question embedding and object embeddings to generate scoring vectors (Para. [0036]) where the generating a scoring vector based on the object embedding and question embeddings “includes a score indicating how well each of the objects in the generated object embeddings generated during process 820 matches the criteria encoded in the question embeddings generated during process 840” (Para. [0051]) thereby teaching a joint embedder using the outputs of the image processing module (vector v) and the language processing module (vector q) as depicted in figures 4 and 5.

Regarding Claim 33:
Trott further teaches:
wherein the answer generator includes a fully connected neural network, the fully connected neural network to receive a plurality of values related to the query representation from a visual knowledge memory network and output a single answer corresponding to a value with a higher score than other values in the plurality of values.
Trott teaches a counter 600 for receiving the scoring vector and coordinates and using trainable weights and bias to generate an answer to the question based on the received scoring vector and coordinates (Para. [0037] & Fig. 6). Trott further teaches “the match level of each of the objects provides an indication of the relative strength of each of the objects that match the question” and thus uses a match level to indicate an answer of counted objects with the greatest relative strength (Para. [0038])

Regarding Claim 36:
All of the limitations herein are similar to some or all of the limitations of Claim 26.

Regarding Claim 37:
All of the limitations herein are similar to some or all of the limitations of Claim 29.

Regarding Claim 38:
All of the limitations herein are similar to some or all of the limitations of Claim 30.

Regarding Claim 39:
All of the limitations herein are similar to some or all of the limitations of Claim 31.

Regarding Claim 43:
All of the limitations herein are similar to some or all of the limitations of Claim 33.

Regarding Claim 46:
Some of the limitations herein are similar to some or all of the limitations of Claim 26.

Trott further teaches:
at least one storage device comprising instructions that, in response to being executed on a computing device, cause the computing device to perform steps (Para. [0046]).

Regarding Claim 50:
All of the limitations herein are similar to some or all of the limitations of Claim 31.

Regarding Claim 51:
Some of the limitations herein are similar to some or all of the limitations of Claim 26.

Trott further teaches:
interface circuitry (Paras. [0020]-[0021] “the image may be received from an imaging device, such as a camera, a video camera, and/or the like” and “the question may be typed in by a user, transcribed from an audio sample, and/or the like”);
executable instructions (Para. [0019]); and
programmable circuitry to be programmed by the executable instructions(Para. [0019]).

Regarding Claim 52:
All of the limitations herein are similar to some or all of the limitations of Claim 29.

Regarding Claim 53:
All of the limitations herein are similar to some or all of the limitations of Claim 33.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 27, 28, 40, and 47 are rejected under 35 U.S.C. 103 as being unpatentable over Trott and further in view of Weston et al. (U.S. Pre-Grant Publication No. 2017/0200077, hereinafter referred to as Weston).

Regarding Claim 27:
Trott teaches all of the limitations as recited above except:
wherein the knowledge entry includes a knowledge triple or a subset of a knowledge triple.

However, in the related field of endeavor of question-answering, Weston teaches:
wherein the knowledge entry includes a knowledge triple or a subset of a knowledge triple.
Weston teaches storing statements organized as (subject, relation, object) triples where “the memory network has been trained using pseudo-labeled question-and-answer pairs including a question and an associated triple, and 35 million pairs of paraphrased questions from a website” (Para. [0087]).

Thus it would have been obvious to a person having ordinary skill in the art, having the teachings of Weston and Trott at the time that the claimed invention was effectively filed, to have combined the knowledge base hashing, as taught by Weston, with the system and method for answering questions about images, as taught by Trott.
One would have been motivated to make such combination because Weston teaches using a hashing method to break down memory entries improves searching efficiency of the knowledge base (Para. [0102])

Regarding Claim 28:
Weston and Trott further teach:
wherein the knowledge spotter is to retrieve the knowledge entry from the visual knowledge base using the subgraph hashing.
Weston teaches “in order to improve the efficiency, the memory network may use a hashing method to break down the memory entries into multiple buckets and calculate the relevancy scores between the input feature vector and all memory entries in a relevant bucket” where “the memory network provides an output based on the output object as a response to the input…the output can include, e.g., a character, a word, a sentence, a paragraph, a string, an image, an udo, a video, or a user interface instruction” (Para. [0101])

Regarding Claim 40:
All of the limitations herein are similar to some or all of the limitations of Claim 28.

Regarding Claim 47:
All of the limitations herein are similar to some or all of the limitations of Claim 28.

Claims 32, 44, 45, and 49 are rejected under 35 U.S.C. 103 as being unpatentable over Trott and further in view of Non-Patent Literature Kim et al., "HADAMARD PRODUCT FOR LOW-RANK BILINEAR POOLING", October 2016, ICLR 2017 (Year: 2016), hereinafter referred to as Kim.

Regarding Claim 32:
Trott teaches all of the limitations as recited above except:
wherein the encoder includes a multimodal low-rank bilinear attention network.

However, in the related field of endeavor of visual question-answering, Kim teaches:
wherein the encoder includes a multimodal low-rank bilinear attention network.
Kim teaches a “low-rank bilinear pooling…for an efficient attention mechanism of multimodal learning” used in visual question-answering tasks on a VQA dataset (Abstract)

Thus it would have been obvious to a person having ordinary skill in the art, having the teachings of Kim and Trott at the time that the claimed invention was effectively filed, to have combined the bilinear models, as taught by Kim, with the system and method for answering questions about images, as taught by Trott.
One would have been motivated to make such combination because Kim teaches “bilinear models provide richer representations than linear models” (Page 1 Section 1) and Trott only teaches linear activation (Para. [0041]).

Regarding Claim 44:
All of the limitations herein are similar to some or all of the limitations of Claim 32.

Regarding Claim 45:
Kim and Trott further teach:
wherein encoding the query representation includes using a multimodel low-rank bilinear pooling to extract a visual attentive feature from an output of a convolutional neural network (CNN) model and a long short term memory (LSTM) model.
Trott teaches encoding each word of a question into a vector of features where “the encoded question…is provided to a language processing module, which generates as output question embedding q that encodes semantic characteristics of question 430” using a recurrent long-term short-term memory (LSTM) network (Paras. [0030]-[0031]). Trott further extracting a visual attentive feature from the output of the image processing module 420(CNN) and language processing module 440 (LSTM) (Fig. 4 and Paras. [0029]-[0033]).
Kim teaches using a multimodel low-rank bilinear (MLB) pooling to extract visual features from multiple feature vectors into a visual feature vector (Pages 3-4 Sections 4-4.2) 

Regarding Claim 49:
All of the limitations herein are similar to some or all of the limitations of Claim 32.


Claims 34, 35, 41, 42, 48, 54, and 55 are rejected under 35 U.S.C. 103 as being unpatentable over Trott and further in view of Govindaraj et al. (U.S. Pre-Grant Publication No. 2019/0205706, hereinafter referred to as Govindaraj).

Regarding Claim 34:
Trott further teaches:
wherein the answer generator includes a visual knowledge memory network, the visual knowledge memory network to store the visual-knowledge, receive the query representation, and output a plurality of values related to the query representation.
Trott teaches receiving information on each object previously selected by object selector 760 (Para. [0041]) where training of the system for visually identifying objects is performed based on image-question pairs (Para. [0064]). Therefore, Trott teaches a pre-built visual knowledge base to store visual knowledge, receive a query, and output values related to the query. Trott further teaches a COCO dataset used in assigning coco categories or background to each of the candidate objects where each coco category is associated with a question (Para. [0072]).

Trott teaches all of the limitations as recited above except:
store the visual-knowledge features as key-value pairs.

However, in the related field of endeavor of visual object detection, Govindaraj teaches:
store the visual-knowledge features as key-value pairs.
Govindaraj teaches storing features of images in a knowledge base by breaking the individual elements into key/value pairs (Para.[0045]).

Thus it would have been obvious to a person having ordinary skill in the art, having the teachings of Govindaraj and Trott at the time that the claimed invention was effectively filed, to have combined the dynamically updated knowledge base engine, as taught by Govindaraj, with the system and method for answering questions about images, as taught by Trott.
One would have been motivated to make such combination because Govindaraj teaches a system that “learns and detects newly encountered objects and dynamically updates knowledge base engine 208 for future use” (Para. [0070]) and it would have been obvious to a person having ordinary skill in the art that learning about newly encountered objects and dynamically expanding the knowledge in the knowledge base with the newly encountered object would improve the capabilities of the question-answer system taught by Trott by being able to learn and account for questions related to new objects that weren’t previously known to the system.

Regarding Claim 35:
Govindaraj and Trott further teach:
wherein the answer generator is to generate the answer by reading a key-value pair of the visual knowledge features corresponding to the query representation and generating the answer based on the key-value pair.
Trott teaches generating an answer to the question about the input image based on the combined query embedding and object embeddings (Para. 0019). Trott further teaches a COCO dataset used in assigning coco categories or background to each of the candidate objects where each coco category is associated with a question (Para. [0072]). Govindaraj teaches storing features of images in a knowledge base by breaking the individual elements into key/value pairs (Para.[0045]).
Therefore, Trott in combination with Govindaraj teaches generating the answer by reading a key-value pair of image features for objects in the image corresponding to a query and generating an answer based on the image features corresponding to the query.

Regarding Claim 41:
All of the limitations herein are similar to some or all of the limitations of Claim 34.

Regarding Claim 42:
All of the limitations herein are similar to some or all of the limitations of Claim 35.

Regarding Claim 48:
All of the limitations herein are similar to some or all of the limitations of Claim 34.

Regarding Claim 54:
All of the limitations herein are similar to some or all of the limitations of Claim 34.

Regarding Claim 55:
All of the limitations herein are similar to some or all of the limitations of Claim 35.



Response to Amendment
Applicant’s Amendments, filed on 9/15/2022, are acknowledged and accepted.
In light of the amendments filed on 9/15/2022, the 35 U.S.C. 101 rejection of claim 46 has been withdrawn.
As stated above and restated here for convenience, Applicant’s amendments and remarks filed on 9/15/2022 have been fully considered and but were not found to overcome the previously cited prior art. Accordingly, THIS ACTION IS MADE FINAL.


Response to Arguments
On pages 10-11 of the Remarks field on 9/15/2022, Applicant argues that “providing a question that is received in an encoded form in which each word is encoded into a vector to a language processing module, as described by Trott, does not teach the encoder of claim 26 to encode an input image and a question into a query representation, as set form in claim 26”. Applicant further states that “the object embeddings of Trott that include an encoding representing the content of the image do not constitute the query representation of claim 26 into which an input image and a question are encoded.Applicant’s argument is not convincing because claim 26, when read in its broadest reasonable interpretation, does not clarify how the encoding of the input image and the question are represented in the query representation. Thus, the teachings of Trott’s generated object embeddings and question embeddings are collectively understood to make up the claimed query representation (Paras. [0030]-[0031] and [0048]).



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Non-Patent Literature Yu et al. "Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering", October 2017, 2017 IEEE International Conference on Computer Vision (ICCV). (Year: 2017) teaches a multi-modal factorized bilinear pooing approach to efficiently and effectively combine multi-modal features where for fine-grained image and question representation, a ‘co-attention’ mechanism using an end-to-end deep network architecture is used to jointly learn both the image and question attentions.
Yang et al. (U.S. Patent No. 10,198,671) teaches a dense captioning system and method is provided for processing an image to produce a feature map of the image, analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image, analyzing the feature map to determine a plurality of region features of the image, and analyzing the feature map to determine a context feature for the image where for each region feature of the plurality of region features of the image, the dense captioning system further provides for analyzing the region feature to determine a detection score for the region feature, calculating a caption for a bounding box for a visual concept in the image using the region feature and the context feature, and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept. The reference further teaches the local descriptions provide a rich and dense semantic labeling of the visual elements, which can benefit other tasks such as semantic segmentation and visual question answering.
Ma et al. (U.S. Pre-Grant Publication No. 2017/0308531) teaches receiving a query question; performing a semantic analysis of the question; performing corresponding search processing for the question based on a result of the semantic analysis, wherein the search processing includes search processing performed for the question by at least one of a semantic relationship mining system, a text library search system, a knowledge base search system, and a question and answer library search system; and returning an answer based on a result of the search processing.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT F MAY whose telephone number is (571)272-3195. The examiner can normally be reached Monday-Friday 9:30am to 6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hosain Alam can be reached on 571-272-3978. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT F MAY/Examiner, Art Unit 2154                                                                                                                                                                                                        9/24/2022

/HOSAIN T ALAM/Supervisory Patent Examiner, Art Unit 2154