DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 3 January 2022 has been entered.

Response to Amendment
Applicant’s response, filed 3 January 2022, to the last office action has been entered and made of record. 
In response to the cancellation of claims 6 and 14, they are acknowledged and made of record.
In response to the amendments to the claims, they are acknowledged, supported by the original disclosure, and no new matter is added.
In response to the addition of new claims 27-28, they are acknowledged and made of record.

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Junqi Hang (Reg No. 54,615) on 7 March 2022.
The application has been amended as follows: 

Claim 1. (Currently Amended) A subtitle extraction method, comprising: 
decoding a video to obtain video frames; 
performing adjacency operation in a subtitle arrangement direction on pixels in the video frames to obtain adjacency regions in the video frames; 
determining certain video frames include a same subtitle based on the adjacency regions in the video frames, wherein the step of determining includes:
conducting a mode 1 of obtaining a difference value between pixels of the adjacency regions in the certain video frames;
conducting a mode 2 of extracting feature points from the adjacency regions in the certain video frames via a scale-invariant feature transform (SIFT) algorithm; and
combining the mode 1 and the mode 2 such that upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle; 
after determining the certain video frames include the same subtitle, superimposing subtitle regions of the certain video frames including the same subtitle and averaging the subtitle regions as 
constructing a component tree for at least two channels of the new subtitle region in the certain video frames, and using the constructed component tree to extract contrasting extremal regions respectively corresponding to the at least two channels, wherein the component tree includes a series of nodes nested in order from bottom to top, and an area-change-rate RΔS between a node (N, i) and an ancestor node (N, i+Δ) is represented by formula             
                
                    
                        R
                    
                    
                        Δ
                        S
                    
                
                (
                
                    
                        n
                    
                    
                        i
                    
                
                ,
                
                    
                        n
                    
                    
                        i
                        +
                        Δ
                    
                
                )
                =
                
                    
                        
                            
                                S
                                n
                            
                            
                                i
                                +
                                Δ
                            
                        
                        -
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                    
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                
            
        , Sni represents an area of the node (N, i), Sni+Δ represents an area of the ancestor node (N, i), and the contrasting extremal regions are extracted according to the area-change-rate RΔS, wherein the node (N, i) and the ancestor node (N, i+Δ) of the component tree correspond to characters of the new subtitle region, and upon determining the area-change-rate RΔS is less than an area-change rate threshold, the node (N, i) is determined to belong to the contrasting extremal regions of the at least two channels; 
performing color enhancement processing on the contrasting extremal regions of the at least two channels to form color-enhanced contrasting extremal regions; and
extracting the same subtitle by merging the color-enhanced contrasting extremal regions of the at least two channels.

Claim 9. (Currently Amended) A subtitle extraction device, comprising: a memory storing computer program instructions; and a processor coupled to the memory and, upon executing the computer program instructions, configured to perform:
decoding a video to obtain video frames; 
performing adjacency operation in a subtitle arrangement direction on pixels in the video frames to obtain adjacency regions in the video frames; 

conducting a mode 1 of obtaining a difference value between pixels of the adjacency regions in the certain video frames;
conducting a mode 2 of extracting feature points from the adjacency regions in the certain video frames via a scale-invariant feature transform (SIFT) algorithm; and
combining the mode 1 and the mode 2 such that upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold, and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle; 
after determining the certain video frames include the same subtitle, superimposing subtitle regions of the certain video frames including the same subtitle and averaging the subtitle regions as superimposed to form a new subtitle region, wherein the subtitle regions are averaged to obtain a mean value as the new subtitle region;
constructing a component tree for at least two channels of the new subtitle region in the certain video frames, and using the constructed component tree to extract contrasting extremal regions respectively corresponding to the at least two channels, wherein the component tree includes a series of nodes nested in order from bottom to top, and an area-change-rate RΔS between a node (N, i) and an ancestor node (N, i+Δ) is represented by formula             
                
                    
                        R
                    
                    
                        Δ
                        S
                    
                
                (
                
                    
                        n
                    
                    
                        i
                    
                
                ,
                
                    
                        n
                    
                    
                        i
                        +
                        Δ
                    
                
                )
                =
                
                    
                        
                            
                                S
                                n
                            
                            
                                i
                                +
                                Δ
                            
                        
                        -
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                    
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                
            
        , Sni represents an area of the node (N, i), Sni+Δ represents an area of the ancestor node (N, i), and the contrasting extremal regions are extracted according to the area-change-rate RΔS, wherein the node (N, i) and the ancestor node (N, i+Δ) of the component tree correspond to characters of the new subtitle region, and upon determining the area-change-rate RΔS is less than an area-change rate threshold, the node (N, i) is determined to belong to the contrasting extremal regions of the at least two channels; 

extracting the same subtitle by merging the color-enhanced contrasting extremal regions of the at least two channels.

Claim 17. (Currently Amended) A non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform:
decoding a video to obtain video frames; 
performing adjacency operation in a subtitle arrangement direction on pixels in the video frames to obtain adjacency regions in the video frames; 
determining certain video frames include a same subtitle based on the adjacency regions in the video frames, wherein the step of determining includes:
conducting a mode 1 of obtaining a difference value between pixels of the adjacency regions in the certain video frames; 
conducting a mode 2 of extracting feature points from the adjacency regions in the certain video frames via a scale-invariant feature transform (SIFT) algorithm; and
combining the mode 1 and the mode 2 such that upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle; 
after determining the certain video frames include the same subtitle, superimposing subtitle regions of the certain video frames including the same subtitle and averaging the subtitle regions as superimposed to form a new subtitle region, wherein the subtitle regions are averaged to obtain a mean value as the new subtitle region;
ΔS between a node (N, i) and an ancestor node (N, i+Δ) is represented by formula             
                
                    
                        R
                    
                    
                        Δ
                        S
                    
                
                (
                
                    
                        n
                    
                    
                        i
                    
                
                ,
                
                    
                        n
                    
                    
                        i
                        +
                        Δ
                    
                
                )
                =
                
                    
                        
                            
                                S
                                n
                            
                            
                                i
                                +
                                Δ
                            
                        
                        -
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                    
                        
                            
                                S
                                n
                            
                            
                                i
                            
                        
                    
                
            
        , Sni represents an area of the node (N, i), Sni+Δ represents an area of the ancestor node (N, i), and the contrasting extremal regions are extracted according to the area-change-rate RΔS, wherein the node (N, i) and the ancestor node (N, i+Δ) of the component tree correspond to characters of the new subtitle region, and upon determining the area-change-rate RΔS is less than an area-change rate threshold, the node (N, i) is determined to belong to the contrasting extremal regions of the at least two channels; 
performing color enhancement processing on the contrasting extremal regions of the at least two channels to form color-enhanced contrasting extremal regions; and
extracting the same subtitle by merging the color-enhanced contrasting extremal regions of the at least two channels.

Allowable Subject Matter
Claims 1, 2, 5, 7-10, 13, 15-18, 20, 22-23, and 25-28 are allowed.
The following is an examiner’s statement of reasons for allowance: 
Regarding the subject matter of the amended independent claims 1, 9, and 17, the prior art of record, alone or in combination, fails to fairly teach or suggest, when combined with the other recited claimed subject matter, the following limitations:
“determining certain video frames include a same subtitle based on the adjacency regions in the video frames, wherein the step of determining includes:
conducting a mode 1 of obtaining a difference value between pixels of the adjacency regions in the certain video frames; 
conducting a mode 2 of extracting feature points from the adjacency regions in the certain video frames via a scale-invariant feature transform (SIFT) algorithm; and
combining the mode 1 and the mode 2 such that upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle; 
after determining the certain video frames include the same subtitle, superimposing subtitle regions of the certain video frames including the same subtitle and averaging the subtitle regions as superimposed to form a new subtitle region, wherein the subtitle regions are averaged to obtain a mean value as the new subtitle region;
constructing a component tree for at least two channels of the new subtitle region in the certain video frames, and using the constructed component tree to extract contrasting extremal regions respectively corresponding to the at least two channels, wherein the component tree includes a series of nodes nested in order from bottom to top, and an area-change-rate RΔS between a node (N, i) and an ancestor node (N, i+Δ) is represented by formula                         
                            
                                
                                    R
                                
                                
                                    Δ
                                    S
                                
                            
                            (
                            
                                
                                    n
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    n
                                
                                
                                    i
                                    +
                                    Δ
                                
                            
                            )
                            =
                            
                                
                                    
                                        
                                            S
                                            n
                                        
                                        
                                            i
                                            +
                                            Δ
                                        
                                    
                                    -
                                    
                                        
                                            S
                                            n
                                        
                                        
                                            i
                                        
                                    
                                
                                
                                    
                                        
                                            S
                                            n
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    , Sni represents an area of the node (N, i), Sni+Δ represents an area of the ancestor node (N, i), and the contrasting extremal regions are extracted according to the area-change-rate RΔS, wherein the node (N, i) and the ancestor node (N, i+Δ) of the component tree correspond to characters of the new subtitle region, and upon determining the area-change-rate RΔS is less than an area-change rate threshold, the node (N, i) is determined to belong to the contrasting extremal regions of the at least two channels”.
Previously cited Hirayama reference is relied upon to teach in a related and pertinent subtitle detection method, the known techniques of obtaining video frames from decoding video data streams 
Previously cited Yusufu reference is relied upon to teach a known technique for text tracking in video images, where SIFT features are detected in detected text regions and text disappearing frames are determined based on the change of feature numbers in the text regions, where feature point numbers in corresponding regions in neighboring frames change drastically, the current frame can be determined as a text disappearing frame.
	While the combined teachings of Hirayama and Yusufu would suggest to one of ordinary skill in the art to determine subtitle regions of video frames have changed by satisfying at least one condition of taking the sum of luminance level difference of corresponding pixels between two neighboring frames and being less than a predetermined threshold, and performing SIFT feature detection on the subtitle regions matching the number of detected features to detect text disappearing frames; the combined teachings of Hirayama and Yusufu do not fairly teach or suggest of combining both conditions of taking the sum of luminance level difference of corresponding pixels between two neighboring frames and being less than a predetermined threshold and performing SIFT feature detection on the subtitle regions matching the number of detected features to detect text disappearing frames are used to determine that subtitle regions of video frames have changed. 
	Thus, the combined teachings of Hirayama and Yusufu, do not fairly teach or suggest, alone or in combination, “upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle”.
	

Previously cited Sun reference is relied upon to teach the known technique of performing robust text detection in images based on generalized color-enhanced contrasting extremal regions (CER), where component trees are built for hue and saturation channel images, pruned, and CERs are extracted from the remaining extremal regions on each component tree. 
	Previously cited Sun2015 reference is relied upon to teach in a related method of text detection based on color-enhanced CER, component trees are built as max-tree types which comprise sequences of nested extremal region nodes and calculating the area variation between a node with its ancestor node to determine if the node is considered to be a contrasting extremal regions. Sun2015 further teaches that a node is considered to be a contrasting extremal region if the area variation between the node and its ancestor node is less than a threshold (see Sun2015 p. 2910, left column). 
	While the teachings of previously cited Agnihotri, Sun, and Sun2015, combined with the teachings of previously cited Sang, Hirayama, and Yusufu references, would suggest to one of ordinary skill in the art that detected subtitle regions of successive frames may be integrated and averaged to improve the text regions clarity, and to perform text detection based constructing component trees to extract CER by computing the area variation between a node its ancestor node and determining the area variation is less than a threshold; the combined teachings of the cited prior art references do not fairly teach or suggest that the determination that certain video frames include the same subtitle based upon determining the difference value between the pixels of the adjacency regions in the certain video frames is less than a difference threshold and that the feature points extracted from the adjacency regions in the certain video frames via  a SIFT algorithm are matched.

	Khodadadi et al. (“Text Localization, Extraction and Inpainting in Color Images”) is pertinent in teaching a method for text localization based on computing the image gradients and extracting the text by performing color segmentation.
	Lim et al. (“Text Extraction in MPEG Compressed Video for Content-based Indexing*”) is pertinent in teaching method for extracting text from compressed video where a text frame is detected based on satisfying two conditions where the pixel intensity if higher than a threshold and the neighboring pixel difference is greater than a second threshold. 
	Tang et al. (“A Spatial-Temporal Approach for Video Caption Detection and Recognition”) is pertinent in teaching a video caption detection and recognition system, where a fuzzy-clustering neural network (FCNN) classifier is used to segment a video sequence into camera shots, caption transitions are detected based on a quantized spatial difference density metric which computes a pair wise pixel difference between adjacent frame pairs, and the FCNN classifier is used to locate caption regions in the transition frame difference. 
	Zhao et al. (“Text From Corners: A Novel Approach to Detect Text and Caption in Videos”) is pertinent in teaching a method for detecting text and caption in videos, where corner features are used to detect text regions and optical flow computed motion features are combined to detect moving captions. 

The above noted references fail to fairly teach or suggest the combination of claimed subject matter of “determining certain video frames include a same subtitle based on the adjacency regions in the video frames, wherein the step of determining includes: conducting a mode 1 of obtaining a difference value between pixels of the adjacency regions in the certain video frames; conducting a mode 2 of extracting feature points from the adjacency regions in the certain video frames via a scale-invariant feature transform (SIFT) algorithm; and combining the mode 1 and the mode 2 upon determining the difference value between the pixels of the adjacency regions is less than a difference threshold and that the feature points extracted from the adjacency regions are matched, determining the certain video frames include the same subtitle”.

Regarding claims 2, 5, 7-8, 10, 13, 15-16, 18, 20, 22-23, and 25-28, they are dependent claims of independent claims 1, 9, and 17, which incorporate the allowable subject matter of the independent claims they depend from, and are therefore allowed.

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIMOTHY WING HO CHOI whose telephone number is (571)270-3814. The examiner can normally be reached 9:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VINCENT RUDOLPH can be reached on (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.




/TIMOTHY CHOI/Examiner, Art Unit 2661                                                                                                                                                                                                        

/VINCENT RUDOLPH/Supervisory Patent Examiner, Art Unit 2661