DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
2.	Claims 1-5 and 11-20 have been examined and rejected. This is the first Office action on the merits.

Claim Rejections - 35 USC § 102
3.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

4.	Claims 1-3, 11-13, and 16-18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jing et al (“On the Automatic Generation of Medical Imaging Reports,” November 22, 2017), which incorporates teachings from Simonyan et al (“Very Deep Convolutional Networks for Large-Scale Image Recognition,” April 10, 2015), and with evidence provided by Saama (“Different Kinds of Convolutional Filters,” December 20, 2017).

4-1.	Regarding claims 1, 11, and 16, Jing teaches the claim comprising: receiving a medical image to be recognized; importing the medical image into a preset VGG neural network to acquire a visual feature vector and a keyword sequence of the medical image, by disclosing that for a given image, extracting visual features from the last convolutional layer of the VGG-19, and adopt the last two fully connected layers of VGG-19 as the multi-label classification to generate predicted relevant tags for the image [Jing, 3.1. Overview; 3.2. Tag prediction]. The top M tags are used as semantic features for topic generation [Jing, 3.2. Tag prediction, paragraph 3].
Jing teaches importing the visual feature vector and the keyword sequence into a preset diagnostic item recognition model to determine diagnostic items corresponding to the medical image, by disclosing that the visual and semantic features are fed into a co-attention model to generate a context vector that simultaneously captures the visual and semantic information of the image [Jing, 3.1. Overview, lines 14-18]. The context vector is inputted into a sentence LSTM, which unrolls for a few steps, each producing a topic vector that represent the semantics of a sentence to be generated [Jing, 3.1. Overview, lines 26-32].
Jing teaches respectively constructing a paragraph for describing each of the diagnostic items based on a diagnostic item extension model, by disclosing generating a sequence of high-level topic vectors representing sentences, then generating a sentence (a sequence of words) from each topic vector [Jing, 3.1. Overview, lines 22-26].
Jing teaches generating a medical report for the medical image based on the paragraph, the keyword sequence and the diagnostic items, by disclosing generating a medical imaging report using a multi-task hierarchical model with co-attention for automatically predicting keywords and generating long paragraphs [Jing, 3.1. Overview, lines 1-7; 5. Conclusion]. The medical imaging report comprises both unstructured descriptions (in the form of sentences and paragraphs) and semi-structured tags (in the form of keyword lists), as shown in Figure 1 [Jing, 3.1. Overview, lines 1-4; figure 1]. This includes an impression section and a findings section, and a tags section of keywords [Jing, Figure 1 Caption].

4-2.	Regarding claims 2, 12, and 17, Jing teaches all the limitations of claims 1, 11, and 16 respectively, wherein the step of importing the medical image into a preset VGG neural network to acquire a visual feature vector and a keyword sequence of the medical image comprises: constructing a pixel matrix of the medical image based on pixel values of pixels in the medical image and position coordinates of the pixel values, by disclosing that the convolution layer used [Jing, 3.2. Tag prediction] requires moving a filter or kernel across a given image, as evidenced by Saama [see Saama - What is a Convolution?].
	Jing teaches performing dimensionality reduction on the pixel matrix through five pooling layers of the VGG neural network to acquire the visual feature vector, by disclosing that spatial pooling is carried out by five max-pooling layers [see Simonyan, 2.1 Architecture, paragraph 2, which Jing incorporates in 3.2. Tag prediction, paragraph 2 and which describes the ConvNet configuration used by Jing].
Jing teaches importing the visual feature vector into a fully connected layer of the VGG neural network and outputting an index sequence corresponding to the visual feature vector, by disclosing extracting visual features from the last convolutional layer of VGG-19 and using the last two fully connected layers of VGG-19 as the MLC to generate predicted relevant tags for the image [Jing, 3.1. Overview; 3.2. Tag prediction].
Jing teaches determining the keyword sequence corresponding to the index sequence according to a keyword index table, by disclosing that the top M tags are used as semantic features for topic generation [Jing, 3.2. Tag prediction, paragraph 3]. In the tag vocabulary, each tag is represented by a word-embedding vector and given the predicted tags for a specific image, their word-embedding vectors are retrieved to serve as the semantic features of this image [Jing, 3.1. Overview, lines 10-14].

4-3.	Regarding claims 3, 13, and 18, Jing teaches all the limitations of claims 1, 12, and 16 respectively, wherein the step of importing the visual feature vector and the keyword sequence into a preset diagnostic item recognition model to determine diagnostic items corresponding to the medical image comprises: generating a keyword feature vector corresponding to the keyword sequence based on sequence numbers of keywords in a preset text corpus, by disclosing that the top M tags are used as semantic features for topic generation [Jing, 3.2. Tag prediction, paragraph 3].
Jing teaches respectively importing the keyword feature vector and the visual feature vector into a preprocessing function to acquire a preprocessed keyword feature vector and a preprocessed visual feature vector, by disclosing using a single layer feed-forward network to compute soft visual and semantic attentions over input image features and tags [Jing, 3.3. Co-Attention]. 
Jing teaches wherein the preprocessing function is specifically as:
                
                    σ
                    
                        
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                    
                    =
                     
                    
                        
                            
                                
                                    e
                                
                                
                                    
                                        
                                            z
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                        
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        M
                                    
                                
                                
                                    
                                        
                                            e
                                        
                                        
                                            
                                                
                                                    z
                                                
                                                
                                                    j
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
            
where                         
                            σ
                            
                                
                                    
                                        
                                            z
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                     is a value of j-th element in the preprocessed keyword feature vector or in the preprocessed visual feature vector,                         
                            
                                
                                
                                    
                                        
                                            z
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                     is a value of j-th element in the keyword feature vector or in the visual feature vector, M is the number of elements corresponding to the keyword feature vector or the visual feature vector, by disclosing that the loss function used is a softmax cross-entropy loss [4.6. Tag prediction]. The equation above is simply the softmax function.
Jing teaches determining the preprocessed keyword feature vector and the preprocessed visual feature vector as an input of the diagnostic item recognition model, and outputting the diagnostic items, by disclosing combining the visual and semantic context vectors by first concatenating visual and semantic context vectors, and then using a fully connected layer to obtain joint context vector [3.3. Co-Attention, last paragraph]. A sentence LSTM takes the joint context vector as its input, and it generates topic vector for word LSTM through topic generator and determines whether to continue or stop generating captions by stop control component [3.4. Sentence LSTM].

Claim Rejections - 35 USC § 103
5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

6.	Claims 4, 14, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Jing et al (“On the Automatic Generation of Medical Imaging Reports,” November 22, 2017) in view of Poghosyan et al (“Long Short-Term Memory with Read-only Unit in Neural Image Caption Generator,” September 29, 2017).

6-1.	Regarding claims 4, 14, and 19, Jing teaches all the limitations of claims 1, 11, and 16 respectively, wherein the method further comprises: acquiring training visual vectors, training keyword sequences and training diagnostic items of a plurality of training images, by disclosing that each training example is a tuple (I, l, w) where I is an image, l denotes the ground-truth tag vector and w is the diagnostic paragraph, which is comprised of S sentences, and each of sentence consists of Ts words [Jing, 3.6 Parameter learning, paragraph 1]. 
Jing teaches determining the training visual vectors and the training keyword sequences as an input of a LSTM neural network, determining the training diagnostic items as an output of the LSTM neural network, and adjusting learning parameters in the LSTM neural network so that the LSTM neural network meets a convergence condition;... determining the adjusted LSTM neural network as the diagnostic item recognition model, by disclosing that given a training example (I, l, w), multi-label classification is first performed for I, and produces a distribution рl,pred over all tags, where l is a binary vector which encodes the presence and absence of tags [Jing, 3.6. Parameter Learning, paragraph 2]. The sentence LSTM is then unrolled for S steps to produce topic vectors and distributions [3.6. Parameter learning, paragraph 3]. The training loss of caption generation is the combination of two cross-entropy losses [Jin, 3.6. Parameter learning, paragraph 3].
Jing does not expressly teach wherein the convergence condition is as:
                
                    
                        
                            θ
                        
                        
                            *
                        
                    
                    =
                    a
                    r
                     
                    g
                     
                    
                        
                            m
                            a
                            x
                        
                        
                            θ
                        
                    
                     
                    
                        
                            ∑
                            
                                S
                                t
                                c
                            
                        
                        
                            l
                            o
                            g
                            p
                            (
                            V
                            i
                            s
                            u
                            a
                            l
                            ,
                             
                            K
                            e
                            y
                            w
                            o
                            r
                            d
                            |
                            S
                            t
                            c
                            ;
                            θ
                            )
                        
                    
                
            
where                         
                            
                                
                                    θ
                                
                                
                                    *
                                
                            
                        
                     is the adjusted learning parameter, Visual is the training visual vector, Keyword is the training keyword sequence, Stc is the training’ diagnostic item,                         
                            p
                            (
                            V
                            i
                            s
                            u
                            a
                            l
                            ,
                             
                            K
                            e
                            y
                            w
                            o
                            r
                            d
                            |
                            S
                            t
                            c
                            ;
                            θ
                            )
                        
                     represents an output result of a probability value of the training diagnostic item when the training visual vector and the training keyword sequence are imported into the LSTM neural network with the value of the learning parameter is Ɵ and                         
                            a
                            r
                             
                            g
                             
                            
                                
                                    m
                                    a
                                    x
                                
                                
                                    θ
                                
                            
                             
                            
                                
                                    ∑
                                    
                                        S
                                        t
                                        c
                                    
                                
                                
                                    l
                                    o
                                    g
                                    p
                                    (
                                    V
                                    i
                                    s
                                    u
                                    a
                                    l
                                    ,
                                     
                                    K
                                    e
                                    y
                                    w
                                    o
                                    r
                                    d
                                    |
                                    S
                                    t
                                    c
                                    ;
                                    θ
                                    )
                                
                            
                        
                     is the value of the learning parameter when the probability value takes a maximum value. Poghosyan discloses maximizing the probability of the correct caption for a given image using the convergence condition algorithm above [Poghosyan, II. Model]. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the convergence condition as taught by Poghosyan. This would maximize the probability of the correct caption for a given image.

7.	Claims 5, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jing et al (“On the Automatic Generation of Medical Imaging Reports,” November 22, 2017) in view of Craft of Coding (“Image Binarization (1) : Introduction,” February 13, 2017), herein after, Craft.

7-1.	Regarding claims 5, 15, and 20, Jing teaches all the limitations of claims 1, 11, and 16 respectively, wherein, after receiving a medical image to be recognized, the method further comprises:... dividing the medical image into a plurality of medical sub-images, wherein the step of importing the medical image into a preset VGG neural network to acquire a visual feature vector and a keyword sequence of the medical image comprises: respectively importing the medical sub-images into the VGG neural network to acquire visual feature components and keyword sub-sequences of the medical sub-images; generating the visual feature vector based on the visual feature components, and constructing the keyword sequence based on the keyword sub-sequences, by disclosing dividing the given image into a plurality of regions, using a CNN to learn visual features of these patches, and feeding these visual features into a multi-label classification network to predict the relevant tags [Jing, 3.1. Overview, paragraph 1].
	Jing does not expressly teach performing binaryzation on the medical image to acquire a binarized medical image; identifying a boundary of the binarized medical image, and dividing the medical image into a plurality of medical sub-images. Craft discloses that image binarization was well known and is a form of segmentation where by an image is divided into constituent objects [Craft, paragraph 1]. This is performed when trying to extract an object from an image [Craft, paragraph 1]. Since Jing discloses learning visual features of an image, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to perform, on the medical image of Jing, binarization to divide the image into sub-images, as taught by Craft. This would improve identification of objects within an image.

Conclusion
8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALVIN H TAN whose telephone number is (571)272-8595. The examiner can normally be reached M-F 10AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on 571-272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ALVIN H TAN/Primary Examiner, Art Unit 2178