DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 15 and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an  without significantly more. The claim(s) recite(s) positioning a description statement in an image based on statement attention weights, image attention weight, and matching scores associated with detection weights and visual features.
The limitations of positioning a description statement in an image based on statement attention weights, image attention weight, matching scores, and visual features, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitations in the mind but for the recitation of generic computer components. They are abstract ideas. That is, other than reciting “by a processor” in claims 15 and 20, nothing in the claim element precludes the step from practically being performed in the mind. For example, “analysis processing” in the context of the claims encompasses manually determine attention weights of a description statement and an image. Similarly, the limitation of matching scores based on the determined attention weights and visual features, as drafted, is a process that, 
This judicial exception is not integrated into a practical application because  claims 15 and 20 only recite one additional element – using a processor to perform analysis processing, score matching and determining steps. The processor in all the steps is recited at a high-level of generality (i.e., as a generic processor performing a positioning a description statement in an image based on statement attention weights, image attention weight, matching scores, and visual features) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.. 
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a processor to perform analysis processing, score matching and determining steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims 1, 15 and 20 are not patent eligible.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al (U.S PG-PUB NO. 20190108411 A1) in view of Yu et al (arXiv:1801.08186v3 27 Mar 2018).
-Regarding claim 1, Liu discloses a method for positioning a description statement in an image ([0037], “image object localization … matching relationship between text feature data of an image”; FIGS. 1-9), comprising: performing analysis processing (FIG. 1, text feature data 114, word embedding, virtual attention 118; [0040]-[0041]; FIG. 2) on a to-be-analyzed description statement (FIG. 1, text 112), and a to-be-analyzed image ( FIG. 1, image 116) to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5); obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature ([0035], “a subject … a subject area”; [0079]; FIG. 7), a location feature (Abstract; [0035]; [0038]-[0039]; FIG. 3, S308) and a relationship feature ([0058]) of the to-be-analyzed image (FIGS. 1-3, 7), wherein the to-be-analyzed image comprises a plurality of objects ([0035], “one image may include multiple objects”; FIG. 7), a subject object is an object with a highest attention weight in the plurality of objects ([0067]), the subject feature is a feature of the subject object ([0035]; [0079]; FIG. 7), the location feature is a location feature of the plurality of objects ([0061], “A feature vector … object area location”), and the relationship feature is a relationship feature between the plurality of objects; obtaining a second matching score between the to-be-analyzed description statement and the to-be-analyzed image ([0058], “image-text matching”; [0066], “the score … matching … a probability … object in the original image”; FIG.S 1-3 ) based on the plurality of first matching scores and the plurality of image attention weights; and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the second matching score (FIG. 1, result 110; FIGS. 2-3; [0085], “localizing an area location of the object”).
	Liu is silent to teach obtaining a plurality of statement attention weights of the to-be-analyzed description statement, obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image, a subject object is an object with a highest attention weight in the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects, obtaining a second matching score based on the plurality of first matching scores and the plurality of image attention weights.
Yu: Abstract; Figures 1-8). Yu teaches to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Yu: Figure. 2; Page 3,  1st Col., 2nd paragraph, “word-level attention and module-level weights”, section 3.1, “attention …                         
                            
                                
                                    a
                                
                                
                                    m
                                    ,
                                    t
                                
                            
                        
                     …                         
                            
                                
                                    q
                                
                                
                                    m
                                
                            
                        
                    ”). Yu teaches obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image (Yu: Abstract; Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    s
                                    u
                                    b
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    l
                                    o
                                    c
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    r
                                    e
                                    l
                                
                            
                        
                    ; Page 2, 3rd paragraph, “These embeddings are used to trigger three separate visual modules (for subject, location, and relationship comprehension, each with a different attention model) to compute matching scores”), wherein the to-be-analyzed image comprises a plurality of objects (Yu: Figures 1, 6-8), a subject object is an object with a highest attention weight in the plurality of objects (Yu: Figure 3), the subject feature is a feature of the subject object (Yu: Page 4, 1st Col., 3rd paragraph, section: Phrase-guided Attention Pooling, 2nd Col., section: Matching Function), the location feature is a location feature of the plurality of objects (Yu: Figure 4), and the relationship feature is a relationship feature between the plurality of objects (Yu: Figure 5); obtaining a second matching score (Yu: Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    o
                                    v
                                    e
                                    r
                                    a
                                    l
                                    l
                                
                            
                        
                     ) between the to-be-analyzed description statement and the to-be-analyzed image based on the plurality of first matching scores and the plurality of image attention weights (Yu: Abstract; Figure 1; Page 5, 2nd Col., section Loss Function); and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the Yu: Abstract; Page 2, 2nd Col., 3rd paragraph; Page 5, 2nd Col., section: Results; Table 1; Figures. 6, 8).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a description statement in an image in order to improve performance on bounding box localization and precision on pixel segmentation.
-Regarding claim 15, Liu discloses an apparatus for positioning a description statement in an image ([0037], “image object localization … matching relationship between text feature data of an image”; FIGS. 1-9), comprising: a memory storing processor-executable instructions (FIG. 8, memory 804, 806; [0089]; [0116]; [0128]; [0131]); and a processor arranged to execute the stored processor-executable instructions to perform operations of (FIG. 8, processor 802; [0091]; [0094]; [0116]; [0134]): performing analysis processing (FIG. 1, text feature data 114, word embedding, virtual attention 118; [0040]-[0041]; FIG. 2) on a to-be-analyzed description statement (FIG. 1, text 112) and a to-be-analyzed image ( FIG. 1, image 116)  to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5); obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature ([0035], “a subject … a subject area”; [0079]; FIG. 7), a location feature Abstract; [0035]; [0038]-[0039]; FIG. 3, S308) and a relationship feature ([0058]) of the to-be-analyzed image (FIGS. 1-3, 7), wherein the to-be-analyzed image comprises a plurality of objects ([0035], “one image may include multiple objects”; FIG. 7), a subject object is an object with a highest attention weight in the plurality of objects ([0067]), the subject feature is a feature of the subject object ([0035]; [0079]; FIG. 7), the location feature is a location feature of the plurality of objects ([0061], “A feature vector … object area location”), and the relationship feature is a relationship feature between the plurality of objects; obtaining a second matching score between the to-be-analyzed description statement and the to-be-analyzed image ([0058], “image-text matching”; [0066], “the score … matching … a probability … object in the original image”; FIG.S 1-3 ) based on the plurality of first matching scores and the plurality of image attention weights; and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the second matching score (FIG. 1, result 110; FIGS. 2-3; [0085], “localizing an area location of the object”).
Liu is silent to teach obtaining a plurality of statement attention weights of the to-be-analyzed description statement, obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image, a subject object is an object with a highest attention weight in the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects, obtaining a second matching score based on the plurality of first matching scores and the plurality of image attention weights.
Yu: Abstract; Figures 1-8). Yu teaches to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Yu: Figure. 2; Page 3,  1st Col., 2nd paragraph, “word-level attention and module-level weights”, section 3.1, “attention …                         
                            
                                
                                    a
                                
                                
                                    m
                                    ,
                                    t
                                
                            
                        
                     …                         
                            
                                
                                    q
                                
                                
                                    m
                                
                            
                        
                    ”). Yu teaches obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image (Yu: Abstract; Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    s
                                    u
                                    b
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    l
                                    o
                                    c
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    r
                                    e
                                    l
                                
                            
                        
                    ; Page 2, 3rd paragraph, “These embeddings are used to trigger three separate visual modules (for subject, location, and relationship comprehension, each with a different attention model) to compute matching scores”), wherein the to-be-analyzed image comprises a plurality of objects (Yu: Figures 1, 6-8), a subject object is an object with a highest attention weight in the plurality of objects (Yu: Figure 3), the subject feature is a feature of the subject object (Yu: Page 4, 1st Col., 3rd paragraph, section: Phrase-guided Attention Pooling, 2nd Col., section: Matching Function), the location feature is a location feature of the plurality of objects (Yu: Figure 4), and the relationship feature is a relationship feature between the plurality of objects (Yu: Figure 5); obtaining a second matching score (Yu: Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    o
                                    v
                                    e
                                    r
                                    a
                                    l
                                    l
                                
                            
                        
                     ) between the to-be-analyzed description statement and the to-be-analyzed image based on the plurality of first matching scores and the plurality of image attention weights (Yu: Abstract; Figure 1; Page 5, 2nd Col., section Loss Function); and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the Yu: Abstract; Page 2, 2nd Col., 3rd paragraph; Page 5, 2nd Col., section: Results; Table 1; Figures. 6, 8).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a description statement in an image in order to improve performance on bounding box localization and precision on pixel segmentation.
-Regarding claim 20, Liu discloses a non-transitory computer readable storage medium having stored thereon computer program instructions that (FIG. 8, memory 804, 806; [0013]; [0022]; [0089]; [0116]; [0128]; [0131]-[0132]), when executed by a processor, cause the processor to perform operations (FIG. 8, processor 802; [0091]; [0094]; [0116]; [0134]) of a method for positioning a description statement in an image ([0037], “image object localization … matching relationship between text feature data of an image”; FIGS. 1-9), the method comprising: performing analysis processing (FIG. 1, text feature data 114, word embedding, virtual attention 118; [0040]-[0041]; FIG. 2) on a to-be-analyzed description statement (FIG. 1, text 112) and a to-be-analyzed image ( FIG. 1, image 116) to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5); obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature ([0035], “a subject … a subject area”; [0079]; FIG. 7), a Abstract; [0035]; [0038]-[0039]; FIG. 3, S308) and a relationship feature ([0058]) of the to-be-analyzed image (FIGS. 1-3, 7), wherein the to-be-analyzed image comprises a plurality of objects ([0035], “one image may include multiple objects”; FIG. 7), a subject object is an object with a highest attention weight in the plurality of objects ([0067]), the subject feature is a feature of the subject object ([0035]; [0079]; FIG. 7), the location feature is a location feature of the plurality of objects ([0061], “A feature vector … object area location”), and the relationship feature is a relationship feature between the plurality of objects; obtaining a second matching score between the to-be-analyzed description statement and the to-be-analyzed image ([0058], “image-text matching”; [0066], “the score … matching … a probability … object in the original image”; FIG.S 1-3 ) based on the plurality of first matching scores and the plurality of image attention weights; and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the second matching score (FIG. 1, result 110; FIGS. 2-3; [0085], “localizing an area location of the object”).
Liu is silent to teach obtaining a plurality of statement attention weights of the to-be-analyzed description statement, obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image, a subject object is an object with a highest attention weight in the plurality of objects, and the relationship feature is a relationship feature between the plurality of objects, obtaining a second matching score based on the plurality of first matching scores and the plurality of image attention weights.
Yu: Abstract; Figures 1-8). Yu teaches to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Yu: Figure. 2; Page 3,  1st Col., 2nd paragraph, “word-level attention and module-level weights”, section 3.1, “attention …                         
                            
                                
                                    a
                                
                                
                                    m
                                    ,
                                    t
                                
                            
                             
                            …
                             
                            
                                
                                    q
                                
                                
                                    m
                                
                            
                        
                    ”). Yu teaches obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image (Yu: Abstract; Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    s
                                    u
                                    b
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    l
                                    o
                                    c
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    r
                                    e
                                    l
                                
                            
                        
                    ; Page 2, 3rd paragraph, “These embeddings are used to trigger three separate visual modules (for subject, location, and relationship comprehension, each with a different attention model) to compute matching scores”), wherein the to-be-analyzed image comprises a plurality of objects (Yu: Figures 1, 6-8), a subject object is an object with a highest attention weight in the plurality of objects (Yu: Figure 3), the subject feature is a feature of the subject object (Yu: Page 4, 1st Col., 3rd paragraph, section: Phrase-guided Attention Pooling, 2nd Col., section: Matching Function), the location feature is a location feature of the plurality of objects (Yu: Figure 4), and the relationship feature is a relationship feature between the plurality of objects (Yu: Figure 5); obtaining a second matching score (Yu: Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    o
                                    v
                                    e
                                    r
                                    a
                                    l
                                    l
                                
                            
                        
                     ) between the to-be-analyzed description statement and the to-be-analyzed image based on the plurality of first matching scores and the plurality of image attention weights (Yu: Abstract; Figure 1; Page 5, 2nd Col., section Loss Function); and determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the Yu: Abstract; Page 2, 2nd Col., 3rd paragraph; Page 5, 2nd Col., section: Results; Table 1; Figures. 6, 8).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a description statement in an image in order to improve performance on bounding box localization and precision on pixel segmentation.
-Regarding claims 2 and 16, Liu in view of Yu discloses the method of claim 1 and apparatus of claim 15.
Liu discloses wherein the performing analysis processing (FIG. 1, text feature data 114, word embedding, virtual attention 118; [0040]-[0041]; FIG. 2) on a to-be-analyzed description statement (FIG. 1, text 112) and a to-be-analyzed image ( FIG. 1, image 116) to obtain a plurality of statement attention weights of the to-be-analyzed description statement and a plurality of image attention weights of the to-be-analyzed image (Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5) comprises: performing feature extraction on the to-be-analyzed image to obtain an image feature vector of the to-be-analyzed image (FIG. 1, image 116, image feature vector representation 104; [0039]; [0051]); performing feature extraction on the to-be-analyzed description statement to obtain word embedding vectors of a plurality of words of the to-be-analyzed description statement (FIG. 1, text 112, text feature vector representation 104; [0039]; FIG. 9; [0049]); and obtaining the plurality of statement attention weights Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5).
Liu is silent to teach obtaining plurality of statement attention weights of the to-be-analyzed description statement based on the word embedding vectors of the plurality of words.
In the same field of endeavor, Yu discloses a method for localizing an image region described by a natural language expression (Yu: Abstract; Figures 1-8). Yu teaches obtaining the plurality of statement attention weights of the to-be-analyzed description statement and the plurality of image attention weights of the to-be-analyzed image based on the image feature vector and the word embedding vectors of the plurality of words (Yu: Figures. 1-5; Page 3,  1st Col., 2nd paragraph, “word-level attention and module-level weights”, section 3.1, “                        
                            
                                
                                    e
                                
                                
                                    t
                                
                            
                        
                     attention …                         
                            
                                
                                    a
                                
                                
                                    m
                                    ,
                                    t
                                
                            
                        
                    … phase embedding                         
                            
                                
                                    q
                                
                                
                                    m
                                
                            
                        
                    ). Yu further teaches performing feature extraction on the to-be-analyzed image to obtain an image feature vector of the to-be-analyzed image (Yu: Page 3, 2nd Col., section 3.2, “feature extractor”; Figures 1, 3); performing feature extraction on the to-be-analyzed description statement to obtain word embedding vectors of a plurality of words of the to-be-analyzed description statement (Yu: Figures 1-2, 3-5).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the 
-Regarding claims 3 and 17, Liu in view of Yu discloses the method of claim 1 and apparatus of claim 15.
Liu discloses obtaining the plurality of statement attention weights of the to-be-analyzed description statement and the plurality of image attention weights of the to-be-analyzed image by using a neural network (Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5).
Liu is silent to teach obtaining the plurality of statement attention weights of the to-be-analyzed description statement by using a neural network.
In the same field of endeavor, Yu discloses a method for localizing an image region described by a natural langu  age expression (Yu: Abstract; Figures 1-8). Yu teaches obtaining the plurality of statement attention weights of the to-be-analyzed description statement and the plurality of image attention weights of the to-be-analyzed image by using a neural network (Yu: Abstract; Figures 1-5; Page 3, section 3. Model, subsection 3.1. Language Attention Network).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a 
-Regarding claims 4 and 18, Liu in view of Yu discloses the method of claim 3 and apparatus of claim 17.
Liu is silent to teach  wherein the plurality of statement attention weights comprises a statement subject weight, a statement location weight and a statement relationship weight; the neural network comprises an image attention network; the image attention network comprises a subject network, a location network and a relationship network; the plurality of first matching scores comprises a subject matching score, a location matching score and a relationship matching score, wherein the obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image comprises: inputting the statement subject weight and the subject feature into the subject network for processing to obtain the subject matching score; inputting the statement location weight and the location feature into the location network for processing to obtain the location matching score; and inputting the statement relationship weight and the relationship feature into the relationship network for processing to obtain the relationship matching score.
In the same field of endeavor, Yu discloses a method for localizing an image region described by a natural language expression (Yu: Abstract; Figures 1-8). Yu teaches wherein the plurality of statement attention weights comprises a statement subject weight, a statement location weight and a statement relationship weight (Yu: Figures 1-2, Page 3, subsection 3.1 Language Attention Network); the neural Yu: Figures 1, 3-5; Page 3, subsection 3.2 Visual Modules); the plurality of first matching scores comprises a subject matching score, a location matching score and a relationship matching score (Yu: Abstract; Figure 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    s
                                    u
                                    b
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    l
                                    o
                                    c
                                
                            
                        
                    ,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    r
                                    e
                                    l
                                
                            
                        
                    ), wherein the obtaining a plurality of first matching scores based on the plurality of statement attention weights and a subject feature, a location feature and a relationship feature of the to-be-analyzed image comprises (Yu: Abstract; Figures 1): inputting the statement subject weight and the subject feature into the subject network for processing to obtain the subject matching score (Yu: Figures 3; Page 3, subsection 3.2.1 Subject Module); inputting the statement location weight and the location feature into the location network for processing to obtain the location matching score (Yu: Figures 4; Page 3, subsection 3.2.2 Location Module); and inputting the statement relationship weight and the relationship feature into the relationship network for processing to obtain the relationship matching score (Yu: Figures 3; Page 5, subsection 3.2.3 Relationship Module).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a description statement in an image in order to improve performance on bounding box localization and precision on pixel segmentation.


Liu does teach processing the image feature data by using an attention model and outputting weight distribution data corresponding to the local blocks ([0021], Abstract; FIG. 1, weight distribution map 108; [0037], “weight … to-be-identified image”; [0039], “attention model … different weights”; [0043]; [0066], “weight calculation … attention model”; [0067]; FIGS. 3, 5), and a subject feature ([0035], “a subject … a subject area”; [0079]; FIG. 7), a location feature (Abstract; [0035]; [0038]-[0039]; FIG. 3, S308) and a relationship feature ([0058]). 
Liu is silent to teach a subject object weight, an object location weight and an object relationship weight, wherein the obtaining a second matching score between the to-be-analyzed description statement and the to-be-analyzed image based on the plurality of first matching scores and the plurality of image attention weights comprises: performing weighted averaging on the subject matching score, the location matching score and the relationship matching score based on the subject object weight, the object location weight and the object relationship weight to determine the second matching score.
In the same field of endeavor, Yu discloses a method for localizing an image region described by a natural language expression (Yu: Abstract; Figures 1-8). Yu teaches wherein the plurality of image attention weights comprises a subject object weight, an object location weight and an object relationship weight (Yu: Abstract; Figures 1), wherein the obtaining a second matching score between the to-be-analyzed description statement and the to-be-analyzed image based on the plurality of first Yu: Abstract; Figures 1,                         
                            
                                
                                    S
                                    C
                                    O
                                    R
                                    E
                                
                                
                                    o
                                    v
                                    e
                                    r
                                    a
                                    l
                                    l
                                
                            
                        
                    ): performing weighted averaging on the subject matching score, the location matching score and the relationship matching score based on the subject object weight, the object location weight and the object relationship weight to determine the second matching score (Yu: Abstract; Figures 1; Page 6, subsection 3.3 Loss Function, equation (1)).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Liu with the teaching of Yu by using modular attention network (MAttNet) for positioning a description statement in an image in order to improve performance on bounding box localization and precision on pixel segmentation.
-Regarding claim 6, Liu in view of Yu discloses the method of claim 1.
The combination further discloses inputting the to-be-analyzed image into a feature extraction network for processing to obtain the subject feature, the location feature and the relationship feature (Liu: Abstract; FIGS. 1-3, 7, 9; [0035], “a subject … a subject area”; [0038]-[0039]; [0058]; [0079]; FIG. 7; Yu: Figures 3-5; Page 3, subsection 3.2 Visual Modules, 1st paragraph, “ResNet … main feature extractor”).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Liu et al (U.S PG-PUB NO. 20190108411 A1) in view of Yu et al (arXiv:1801.08186v3 27 Mar 2018), and further in view of Mu et al (U.S PATENT NO. US 10643112 B1).
-Regarding claim 7, Liu in view of Yu discloses the method of claim 1.

However, Mu is an analogous art pertinent to the problem to be solved in this application and discloses a system and method to extract features from a content item and provide extracted features to a machine learning based model configured to generate a deceptive information score indicating a likelihood of a content item comprising deceptive information. The generated deceptive information score  indicates whether a content item is deceptive (Mu: Abstract; FIG. 1). Mu further discloses to determine that the content items is deceptive if the deceptive information score is above a threshold value (Mu: FIG. 1; Col 4, lines 53-60).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Liu in view of Yu with the teaching of Mu by determining a positioning result of the to-be-analyzed description statement in the to-be-analyzed image based on the second matching score comprises: responsive to that the second matching score is greater than or equal to a preset threshold, determining an image region of the subject object as a positioning location of the to-be-analyzed description statement in order to provide one of methods to determine desired results.
Allowable Subject Matter
Claims 8-14 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the 
The following is a statement of reasons for the indication of allowable subject matter:  Regarding claim 8,  Liu, Yu and Mu appear to be the closest prior arts on record.  However, cited references as a whole, either alone or in combination do not teach or suggest the allowable subject matter or the claimed limitations in combination with the rest of the claims, such as, inter alia, and each negative sample pair comprises a first sample image and a second sample description statement obtained after a word is removed from the first sample description statement, or comprises a first sample description statement and a second sample image obtained after a region with a highest image attention weight is removed from the first sample image.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAO LIU whose telephone number is (571)272-4539. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nay Maung can be reached on (571) 272-7882. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/XIAO LIU/Examiner, Art Unit 2664                                                                                                                                                                                                        /NANCY BITAR/Primary Examiner, Art Unit 2664