DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant's submission filed on 13 August 2021 has been entered.  Claims 1-5, 7-12 and 14-20 have been amended.  Claims 1-20 are currently pending and have been considered below.

Response to Arguments
The 35 U.S.C. §112(b) rejection of claims 1-20 has been withdrawn in view of Applicant’s amendments.
Applicant’s arguments with respect to claim(s) 1-20 have been fully considered but are moot in view of the new grounds of rejection necessitated by Applicant’s amendments.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 3, 5-8, 10, 12-15, 17, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Aditya, Somak, et al. "Image understanding using vision and reasoning through scene description graph." Computer Vision and Image Understanding 173 (2018): 33-45, hereinafter, “Aditya”, and further in view of Ma, Lin, et al. "Multimodal Convolutional Neural Networks for Matching Image and Sentence." 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, hereinafter, “Ma”.

As per claim 1, Aditya discloses a method for processing an image (Aditya, Abstract, Scene Description Graph … a system that can represent both the content and underlying concepts of an image; Aditya, Figure 1; Aditya, Figure 2), comprising:
determining, based on an object type of an object in an image to be processed, a feature expression of the object in the image to be processed (Aditya, page 36, Section 4.1. Visual Detection, We use deep object recognition, deep scene (category) recognition and deep Observed Scene Constituent recognition as the components of the Visual Detection module (to primarily detect the semantic components) … For each image, we then use the pre trained CNN model ... to extract a 4096 dimensional feature vector. We then trained a multi-label SVM to recognize constituents using these deep features. The output from the detection system consists of object (Pr (n| x)), scene (Pr (s |x)) and constituent (Pr (c| x)) detection scores for the top 5 objects, top 5 scene categories, and top 10 constituents; for each image); and 
semantic relations from KM-Ontology. The graph contains the knowledge of i) all possible Entities (concrete nouns) participating in Events (actions and linking verbs), and ii) possible traits (properties, such as color, semantic role-labels) that the Entities have; Aditya, page 38, Section 4.2.3. Inference through knowledge and reasoning, we use the commonsense knowledge [Kb, Bn, SM] and the detections [Pr (n| x), Pr (s| x), Pr (c| x)] for an image (x ∈ I) to construct the different components of the SDG (a labeled graph) in the following way. We use Entities to denote objects, and Events to denote actions (and linking verbs); Aditya, page 38, Section 4.2.3.V. Inferring Scenes, Given the filtered Events and Entities (Oev), we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it … we weight each candidate Scene ... using the remaining Entities ... We also calculate a joint confidence-score for each scene based on the Pr (n| x), Pr (s |x), Pr (c |x) values of the object, scene category and constituents (OSC) present in the Scene. Based on the counters and the joint confidence-score, we rank the Scenes; Aditya, page 40, Figure 6, matching image object with entity scene graph expression).
Adiitya does not explicitly disclose the following limitations as further recited however Ma discloses 
performing matching calculation between the feature expression of the object and a feature expression of the entity (Ma, page 2623-2624, Introduction, The association between image and scores for the association between image and sentence (e.g., the likelihood of a sentence as the caption for a given image). It can thus be readily used for the bidirectional image and sentence retrieval; Ma, page 2624, Section 3. M-CNNs for Matching Image and Sentence, m-CNN takes the image and sentence as input and generates the matching score between them). 
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Aditya to include the matching / correlation scores as taught by Ma in order to model the relation between images and words or phrases for bidirectional image and sentence retrieval and automatic image captioning (Ma, page 2624, Section 2).

As per claim 3, Aditya and Ma disclose the method of claim 1, wherein determining the entity associated with the object in the image to be processed based on the feature expression of the object in the image to be processed and the feature expression of the entity in the knowledge graph comprises:
determining the entity associated with the object in the image to be processed based on the feature expression of the object in the image to be processed, a feature expression of the image to be processed, and a feature expression of text associated with the image to be processed, the feature expression of the entity in the knowledge graph and entity attribute information, wherein the entity attribute information comprises an essential attribute of the entity (Aditya, page 35, Section 3, The core attributes of objects and regions such as size, height, color of objects; color, shape of region; v) attention … large set of object and scene detection classifiers, relationship detection classifiers, attribute (color, shape, size) and relative attribute classifiers; Aditya, page 37, Section 4.2.2; Aditya, page 38, Section 4.2.3. II. Inferred Scene Constituents: We look-up the ISCs for the top 5 detected scenes; Aditya, page 38, Section 4.23. V. Inferring Scenes: Given the filtered Events and Entities (Oev), we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it. Next, we weight each candidate Scene (Ccand) using the remaining Entities and ISCs; i.e., increase a counter if an Entity or ISC occurs in the graph (Ccand). We also calculate a joint confidence-score for each scene based on the values of the object, scene category and constituents (OSC) present in the Scene. Based on the counters and the joint confidence-score, we rank the Scenes; Aditya, page 36, 4. Predicting intermediate Scene Description Graphs, Scene Description Graph, we first robustly define the meaningful regions of images that capture relevant semantics. Let us assume that the fundamental semantic components of an image (denoted as F) are the objects and their observable attributes).

As per claim 5, Aditya and Ma disclose the method of claim 1, further comprising:
determining a first determination manner as a determination manner of determining the entity (Aditya, pages 38-39, Section 4.2.3, V, Inferring Scenes: We calculate a joint confidence score for each scene based on the values of the object, scene category and constituents present in the Scene.  Based on the counters and the joint confidence score, we rank the Scenes); 
values of the object, scene category and constituents present in the Scene.  Based on the counters and the joint confidence score, we rank the Scenes); and 
re-determining the entity associated with the object in the image to be processed based on a determination manner and determination frequency of each entity contained in the knowledge graph (Aditya, pages 38-39, Section 4.2.3. V. Inferring Scenes, increase a counter if an Entity or ISC occurs in the graph); 
the determination manner comprising a first determination manner and at least one secondary determination manner (Aditya, pages 38-39, Section 4.2.3. V. Inferring Scenes: Given the filtered Events and Entities (Oev), we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it. Next, we weight each candidate Scene (Ccand) using the remaining Entities and ISCs; i.e., increase a counter if an Entity or ISC occurs in the graph (Ccand). We also calculate a joint confidence-score for each scene based on the values of the object, scene category and constituents (OSC) present in the Scene. Based on the counters and the joint confidence-score, we rank the Scenes).

As per claim 6 Aditya and Ma disclose the method of claim 5, wherein determining the entity associated with the object in the image to be processed by the at least one secondary determination manner comprises:
matching the image to be processed with an image of a candidate entity to determine the entity associated with the image to be processed (Aditya, pages 38-39, Section 4.2.3. V. Inferring Scenes); and/or, 
matching text to which the image to be processed belongs with the knowledge graph to determine the entity associated with the image to be processed (Aditya, pages 38-39, Section 4.2.3. V. 

As per claim 7, Aditya and Ma disclose the method of claim 1, further comprising:
selecting new entities having edge relations with the entity associated with the object from the knowledge graph, wherein edge relation refers to an edge connecting a new entity with the entity associated with the object (Aditya, page 37, Figure 4; Aditya, page 38, 4.2.3. IV. Inferring Events: Given the Entities (Ox ), we first find connecting Events between each pair of Entities. To logically find a co-occurring Event for a pair of Entities (e1, e2 ∈ Ox), we consider the Event-nodes on the shortest path from one Entity to another in the graph G ... We retain Events only if they are connected to the Entities using compatible edge pairs in G ... We retain only those Events that are connected to Entities from the same pair of classes; Aditya, page 38, Section 4.2.3. V. Inferring Scenes); and 
selecting an updated entity associated with the image from the new entities based on a an intersection of the new entities (Aditya, page 37, Figure 4; Aditya, page 38, Section 4.2.3. V. Inferring Scenes: Given the filtered Events and Entities, we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it … we weight each candidate Scene using the remaining Entities in (Ox / Oev) and ISCs; i.e., increase a counter if an Entity or ISC occurs in the graph; Aditya, page 38, Figure 5; Aditya, page 38, Section 4.2.3. We use Entities to denote objects, and Events to denote actions (and linking verbs). All the notations and terms used in this paper are summarized in Fig. 5; Aditya, page 38, 

As per claim 8, Aditya discloses a server, comprising:
one or more processors (Aditya, page 33, Introduction, Computer vision system); 
a memory, configured to store one or more programs; and wherein when the one or more programs are executed by the one or more processors, the one or more processors (Aditya, page 43, Section 6. Conclusions, Scene Analysis called the Scene Description Graph (SDG), and an architecture that combines deep Visual Detection and Reasoning modules to infer such structures) are caused to: 
determine, based on an object type of an object in an image to be processed, a feature expression of the object in the image to be processed (Aditya, page 36, Section 4.1. Visual Detection, We use deep object recognition, deep scene (category) recognition and deep Observed Scene Constituent recognition as the components of the Visual Detection module (to primarily detect the semantic components) … For each image, we then use the pre trained CNN model ... to extract a 4096 dimensional feature vector. We then trained a multi-label SVM to recognize constituents using these deep features. The output from the detection system consists of object (Pr (n| x)), scene (Pr (s |x)) and constituent (Pr (c| x)) detection scores for the top 5 objects, top 5 scene categories, and top 10 constituents; for each image); and 
determine an entity associated with the object by performing matching between the feature expression of the object and a feature expression of the entity in a knowledge graph, the feature expression of the object and the feature expression of the entity associated with the object matching with each other (Aditya, page 37, Figure 4; Aditya, page 37, Section 4.2.2, Knowledge Base: The knowledge-base is mainly a knowledge graph (G ), which is a collection of word1-relation-word2 triplets, where word1 and word2 can be Event (actions, linking-verbs present in Atr), Entity (fromN) or a Trait semantic relations from KM-Ontology. The graph contains the knowledge of i) all possible Entities (concrete nouns) participating in Events (actions and linking verbs), and ii) possible traits (properties, such as color, semantic role-labels) that the Entities have; Aditya, page 38, Section 4.2.3. Inference through knowledge and reasoning, we use the commonsense knowledge [Kb, Bn, SM] and the detections [Pr (n| x), Pr (s| x), Pr (c| x)] for an image (x ∈ I) to construct the different components of the SDG (a labeled graph) in the following way. We use Entities to denote objects, and Events to denote actions (and linking verbs); Aditya, page 38, Section 4.2.3.V. Inferring Scenes, Given the filtered Events and Entities (Oev), we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it … we weight each candidate Scene ... using the remaining Entities ... We also calculate a joint confidence-score for each scene based on the Pr (n| x), Pr (s |x), Pr (c |x) values of the object, scene category and constituents (OSC) present in the Scene. Based on the counters and the joint confidence-score, we rank the Scenes; Aditya, page 40, Figure 6, matching image object with entity scene graph expression).
Adiitya does not explicitly disclose the following limitations as further recited however Ma discloses 
performing matching calculation between the feature expression of the object and a feature expression of the entity (Ma, page 2623-2624, Introduction, The association between image and sentence can be formalized as a multimodal matching problem, where the semantically related image and sentence pairs should be assigned higher matching scores than unrelated ones … as shown in Figure 1. The words in the sentence, such as “grass”, “dog”, and “ball”, denote the objects in the image. The phrases describing the objects and their attributes or activities ... correspond to the image areas of their grounding meanings … encode the image, compose different semantic fragments from the words, and learn the matching relations between the image and the composed fragments; Ma, page 2624, Section 
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Aditya to include the matching / correlation scores as taught by Ma in order to model the relation between images and words or phrases for bidirectional image and sentence retrieval and automatic image captioning (Ma, page 2624, Section 2).

As per claim 15, Aditya discloses a non-transitory computer readable storage medium, having computer programs stored thereon, wherein when the programs are executed by a processor (Aditya, page 33, Introduction, Computer vision system; Aditya, page 43, Section 6. Conclusions, Scene Analysis called the Scene Description Graph (SDG), and an architecture that combines deep Visual Detection and Reasoning modules to infer such structures), a method for processing an image is implemented, the method comprising:
determining, based on an object type of an object in an image to be processed, a feature expression of the object in the image to be processed (Aditya, page 36, Section 4.1. Visual Detection, We use deep object recognition, deep scene (category) recognition and deep Observed Scene Constituent recognition as the components of the Visual Detection module (to primarily detect the semantic components) … For each image, we then use the pre trained CNN model ... to extract a 4096 dimensional feature vector. We then trained a multi-label SVM to recognize constituents using these deep features. The output from the detection system consists of object (Pr (n| x)), scene (Pr (s |x)) and 
determining an entity associated with the object by performing matching between the feature expression of the object and a feature expression of the entity in a knowledge graph, the feature expression of the object and the feature expression of the entity associated with the object matching with each other (Aditya, page 37, Figure 4; Aditya, page 37, Section 4.2.2, Knowledge Base: The knowledge-base is mainly a knowledge graph (G ), which is a collection of word1-relation-word2 triplets, where word1 and word2 can be Event (actions, linking-verbs present in Atr), Entity (fromN) or a Trait (adjectives, qualitative-nouns from Atr or WordNet-superclass of a word). The relation comes from a closed set of semantic relations from KM-Ontology. The graph contains the knowledge of i) all possible Entities (concrete nouns) participating in Events (actions and linking verbs), and ii) possible traits (properties, such as color, semantic role-labels) that the Entities have; Aditya, page 38, Section 4.2.3. Inference through knowledge and reasoning, we use the commonsense knowledge [Kb, Bn, SM] and the detections [Pr (n| x), Pr (s| x), Pr (c| x)] for an image (x ∈ I) to construct the different components of the SDG (a labeled graph) in the following way. We use Entities to denote objects, and Events to denote actions (and linking verbs); Aditya, page 38, Section 4.2.3.V. Inferring Scenes, Given the filtered Events and Entities (Oev), we consider a Scene in C as candidate if all edges from a detected valid Event, are present in it … we weight each candidate Scene ... using the remaining Entities ... We also calculate a joint confidence-score for each scene based on the Pr (n| x), Pr (s |x), Pr (c |x) values of the object, scene category and constituents (OSC) present in the Scene. Based on the counters and the joint confidence-score, we rank the Scenes; Aditya, page 40, Figure 6, matching image object with entity scene graph expression).
Adiitya does not explicitly disclose the following limitations as further recited however Ma discloses 

It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Aditya to include the matching / correlation scores as taught by Ma in order to model the relation between images and words or phrases for bidirectional image and sentence retrieval and automatic image captioning (Ma, page 2624, Section 2).

Regarding claim(s) 10 and 17: 
A corresponding reasoning as given earlier (see rejection of claim(s) 3) applies, mutatis mutandis, to the subject-matter of claim(s) 10 and 17, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 3.

Regarding claim(s) 12 and 19: 


Regarding claim(s) 13: 
A corresponding reasoning as given earlier (see rejection of claim(s) 6) applies, mutatis mutandis, to the subject-matter of claim(s) 13, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 6.

Regarding claim(s) 14 and 20: 
A corresponding reasoning as given earlier (see rejection of claim(s) 7) applies, mutatis mutandis, to the subject-matter of claim(s) 14 and 20, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 7.


Claims 2, 4, 9, 11, 16 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Aditya, Somak, et al. "Image understanding using vision and reasoning through scene description graph." Computer Vision and Image Understanding 173 (2018): 33-45, hereinafter, “Aditya”, in view of Ma, Lin, et al. "Multimodal Convolutional Neural Networks for Matching Image and Sentence." 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, hereinafter, “Ma” as applied to claims 1, 8 and 15 above, and further in view of Deng, Li-Qiong, Gui-Xin Zhang, and Yuan Ren. "Image Semantic Analysis and Application Based on Knowledge Graph." 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2018, hereinafter, “Deng”.


obtaining an object image area where the object is located in the image to be processed (Aditya, page 35, Section 3, Visual Detection: The “Visual Detection” module should be able to obtain the following basic quantities: i) Objects and regions).
Aditya and Ma do not explicitly disclose the following limitations as further recited however Deng discloses 
obtaining pixel data corresponding to the object based on the object image area (Deng, page 2, IV. Image Structured Semantic Information Extraction Technology Based on Deep Expression Model, Image information extraction is a method that divides each pixel into different semantic categories and finally obtains different entities, relationships, attributes); and 
determining the feature expression of the object by inputting the pixel data into a deep learning model corresponding to the object type (Deng, page 2, IV. Image Structured Semantic Information Extraction Technology Based on Deep Expression Model, the features of the images can be extracted from a large number of samples data by DCNN).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Aditya and Ma to extract the object using the deep neural network as taught by Deng in order to be able to segment a large number of training and test images autonomously (Deng, page 2, IV. Image Structured Semantic Information Extraction Technology Based on Deep Expression Model).

As per claim 4, Aditya and Ma disclose the method of claim 1, further comprising:

Aditya and Ma do not explicitly disclose the following limitations as further recited however Deng discloses
determining a feature expression of the article based on an article entity in the article (Deng, B. Entity Extraction, CNN model can generate image feature expression with discriminative ability ... the RNN model can predict the structural combinatorial relationship of image or natural language ... the feature expression generated by the CNN model is used as input. The RNN model is used to generate the structured configuration of the scene); and 
determining a relevance between the article and the image by performing the matching calculation between the feature expression of the article and the entity associated with the object (Deng, B. Entity Extraction, the feature expression generated by the CNN model is used as input. The RNN model is used to generate the structured configuration of the scene. Its algorithm is verified by experiments that the accuracy of image entity semantic extraction is high).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Aditya and Ma to include the determining of the accuracy of the feature expression as taught by Deng in order to describe the features of an image (Deng, Abstract).

Regarding claim(s) 9 and 16: 
A corresponding reasoning as given earlier (see rejection of claim(s) 2) applies, mutatis mutandis, to the subject-matter of claim(s) 9 and 16, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 2.

Regarding claim(s) 11 and 18: 
A corresponding reasoning as given earlier (see rejection of claim(s) 4) applies, mutatis mutandis, to the subject-matter of claim(s) 11 and 18, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 4.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/TRACY MANGIALASCHI/Examiner, Art Unit 2668                        
/VU LE/Supervisory Patent Examiner, Art Unit 2668