DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08/17/2022 has been entered.
 Response to Arguments
Applicant’s arguments with respect to claims 25-48 have been considered but are moot because the new ground of rejection set forth below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 25, 28, 29, 31, 34, 35, 37, 40, 41, 43, 46 and 47 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. ("EAST: An Efficient and Accurate Scene Text Detector", published July 10, 2017) in view of Lin et al. US Patent(US 10445569 B1).
Regarding Claim 25, Zhou teaches an electronic processing system, comprising:a processor(Page 8, Left column, Paragraph 1, These experiments were conducted on a server using a single NVIDIA Titan X graphic card with Maxwell architecture and an Intel E5-2670 v3 @ 2.30GHz CPU.); memory communicatively coupled to the processor(Page 8, Left column, Paragraph 1, These experiments were conducted on a server using a single NVIDIA Titan X graphic card with Maxwell architecture and an Intel E5-2670 v3 @ 2.30GHz CPU. The memory would necessarily be required to be coupled to the processor in order for the system to properly function.); and logic communicatively coupled to the processor (Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets the CPU in the prior art has an algorithm which performs the scene text recognition.) to: apply a trained scene text detection network (Page 2, Section 3 Methodology, Paragraph 1, The key component of the proposed algorithm is a neural network model, which is trained to directly predict the existence of text instances and their geometries from full image.) to an image to identify a core text region based on the core text region being determined to include text, a supportive text region based on an identification that the supportive text region is associated with the core text region and is mixed with background information, (Page 2, Left Column, Paragraph 1, The contributions of this work are three-fold:  we propose a scene text detection method that consists of two stages: a Fully Convolutional Network and an NMS merging stage. The FCN directly produces text regions, excluding redundant and time-consuming intermediate steps. The pipeline is flexible to produce either word level or line level predictions, whose geometric shapes can be rotated boxes or quadrangles, depending on specific applications. Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, we only consider the case where the geometry is a quadrangle. The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one, illustrated in Fig. 4 (a). For a quadrangle Q = {pi |i ∈ {1, 2, 3, 4}}, where pi = {xi , yi} are vertices on the quadrangle in clockwise order. The examiner interprets that as seen in figure 4(a) the text quadrangle annotated in with yellow dash lines would represent the supportive text region and the shrunk quadrangle would represent the core text region.)
and a background region of the image(Page 1, Section 1 Introduction, Paragraph 1, The core of text detection is the design of features to distinguish text from backgrounds. The examiner interprets the prior art is able to distinguish the text region and background region of the image.), 
generate a first border for the core text region(Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one. As seen in Figure 4(a) shows a green shrunken quadrangle area pertaining to the core text), and detect text in the image based on the identified core text region and the supportive text region. (Page 8, Section 5 Conclusion and Future Work, Paragraph 1, we have presented a scene text detector that directly produces word or line level predictions from full images with a single neural network. By incorporating proper loss functions, the detector can predict either rotated rectangles or quadrangles for text regions, depending on specific applications. The examiner interprets the prior art is detecting text based on the identified text regions in the image.)
However Zhou does not explicitly teach expand the first border to generate a second border, determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region, determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region, 
Lin teaches expand the first border to generate a second border(Col 7, Lines 50-55, The region refining component 228 can be configured to refine the bounding box coordinates to more accurately fit the text represented in the image. Refining the bounding box can include changing a size (e.g., bigger, smaller) of a bounding box, repositioning a bounding box, changing a shape of a bounding box, or a combination thereof. The examiner interprets the prior art is generating a border around text regions of an image and is resizing the border around the text), determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region (Col 7 Lines 63-67, The text recognizer component 230 is configured to analyze the bounding boxes proposed by the region refining component and can generate a classification vector or other categorization value that indicates the probability that a respective bounding box includes an instance of a certain word. The classification vector can include an entry (i.e., a probability) for each of the categories (e.g., words) the text recognizer component is trained to recognize. In various embodiments, the word recognizer can be a CNN (e.g., a trained word recognition neural network) that maps a bounding box with text to a word.  The examiner interprets the text recognizer will determine if the bounding box includes and instant of a word and will than keep the border if it does.), determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region (Col 8, Lines 7-26, The post processing component 232 can be configured to suppress overlapping words to generate a final set of words. For example, in various embodiments, the output of region proposal filtering component contains a lot of overlapping words. A post processing step can be used to eliminate these duplications. In accordance with various embodiments, the post processing component can perform non-maximum suppression (NMS) of overlapping words. In this example, two kinds of NMS can be used: per-word NMS and cross-word NMS, where NMS can be interleaved with the region refining process. As an example, a variant of bounding box regression called word-end regression can be used. For example, the networks employed are the same, but the extracted regions are only around the two ends of (long) words. In accordance with various embodiments, after several iterations of refinement, the position of the bounding boxes might change. Accordingly, the text recognizer component can be rerun to relabel the bounding boxes. Finally, a grouping step is performed to eliminate words that are contained inside other words. The examiner interprets if there are overlapping words or two ends of along word the borders rejected or will be split into another bounding box.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Zhou with Lin in order to determine borders around text in a image. One skilled in the art would have been motivated to modify Zhou in this manner in order to improve text recognition precision. (Lin, Col 1, Lines 26-27)
Regarding Claim 28, the combination of Zhou and Lin teaches the system of claim 25, wherein the logic is further to: train the scene text detection network with a plurality of image training samples (Zhou, Page 5, Section 3.5 Training, Paragraph 1, The network is trained end-to-end using ADAM [18] optimizer. To speed up learning, we uniformly sample 512x512 crops from images to form a minibatch of size 24. Learning rate of ADAM starts from 1e-3, decays to one-tenth every 27300 minibatches, and stops at 1e-5. The network is trained until performance stops improving.), the scene text detection network including a dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers.  The examiner interprets the feature extractor stem shown in Fig. 3 of the prior art is equivalent to the dense feature portion shown in Fig. 10B of the instant application.),a reverse connections portion communicatively coupled to the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck [8] cuts down the number of channels and reduces computation. The examiner interprets that the feature-merging branch to have the same functionality as the reverse connections portion since  ¶[0042] of the specification of the instant application mentions the reverse connections portion is leverage semantic information from previous layers or stages to increase information flow and the feature merging branch is taking the input from the feature extractor stem and concatenating the data to improve the information flow and it is coupled to the feature extractor stem as shown in figure 3 of the prior art.), and a stage losses portion communicatively coupled to the reverse connections portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In the feature-merging branch, The examiner interprets that the output layer to be equivalent to the stage losses portion and it is coupled to the feature-merging branch as shown in figure 3 of the prior art. Page 4, Section 3.4 Loss functions, Paragraph 1, The loss can be formulated as L = Ls + λgLg (4) where Ls and Lg represents the losses for the score map and the geometry, respectively, and λg weighs the importance between two losses. The examiner interprets the prior art’s output layer is using a loss function to calculate the loss between the score map and the geometry when performing scene text recognition which is similar to the process of the stage loss portion shown in figure 10E)
Regarding Claim 29, the combination of Zhou and Lin teaches the system of claim 28, wherein the logic is further to: support large receptive field features for the dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers. Four levels of feature maps, denoted as fi , are extracted from the stem, whose sizes are 1 32 , 1 16 , 1 8 and 1 4 of the input image, respectively. In Fig. 3, PVANet [17] is depicted. In our experiments, we also adopted the well-known VGG16 [32] model, where feature maps after pooling-2 to pooling-5 are extracted. The examiner interprets that the feature extractor stem is extracting 4 levels of feature maps after the pooling layer which is similar to process of the Larger RF features shown in figure 10C of the instant application.)
Regarding Claim 31, Zhou teaches a semiconductor package apparatus, comprising: one or more substrates(Zhou, Page 8, Left column, Paragraph 1, These experiments were conducted on a server using a single NVIDIA Titan X graphic card with Maxwell architecture and an Intel E5-2670 v3 @ 2.30GHz CPU. The examiner interprets that the CPU would inherently contain a substrate inside the chip of the CPU.);  and logic coupled to the one or more substrates, wherein the logic is at least partly implemented in one or more of configurable logic and fixed-functionality hardware logic, the logic coupled to the one or more substrates to (Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN): 
apply a trained scene text detection network (Page 2, Section 3 Methodology, Paragraph 1, The key component of the proposed algorithm is a neural network model, which is trained to directly predict the existence of text instances and their geometries from full image.) to an image to identify a core text region based on the core text region being determined to include text, a supportive text region based on a determination that the supportive text region is associated with the core text region and is mixed with background information(Page 2, Left Column, Paragraph 1, The contributions of this work are three-fold:  we propose a scene text detection method that consists of two stages: a Fully Convolutional Network and an NMS merging stage. The FCN directly produces text regions, excluding redundant and time-consuming intermediate steps. The pipeline is flexible to produce either word level or line level predictions, whose geometric shapes can be rotated boxes or quadrangles, depending on specific applications. Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, we only consider the case where the geometry is a quadrangle. The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one, illustrated in Fig. 4 (a). For a quadrangle Q = {pi |i ∈ {1, 2, 3, 4}}, where pi = {xi , yi} are vertices on the quadrangle in clockwise order. The examiner interprets that as seen in figure 4(a) the text quadrangle annotated in with yellow dash lines would represent the supportive text region and the shrunk quadrangle would represent the core text region.)
and a background region of the image(Page 1, Section 1 Introduction, Paragraph 1, The core of text detection is the design of features to distinguish text from backgrounds. The examiner interprets the prior art is able to distinguish the text region and background region of the image.), generate a first border for the core text region(Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one. As seen in Figure 4(a) shows a green shrunken quadrangle area pertaining to the core text), 
and detect text in the image based on the identified core text region and the supportive text region. (Page 8, Section 5 Conclusion and Future Work, Paragraph 1, we have presented a scene text detector that directly produces word or line level predictions from full images with a single neural network. By incorporating proper loss functions, the detector can predict either rotated rectangles or quadrangles for text regions, depending on specific applications. The examiner interprets the prior art is detecting text based on the identified text regions in the image.)
However Zhou does not explicitly teach expand the first border to generate a second border, determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region, determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region, 
Lin teaches expand the first border to generate a second border(Col 7, Lines 50-55, The region refining component 228 can be configured to refine the bounding box coordinates to more accurately fit the text represented in the image. Refining the bounding box can include changing a size (e.g., bigger, smaller) of a bounding box, repositioning a bounding box, changing a shape of a bounding box, or a combination thereof. The examiner interprets the prior art is generating a border around text regions of an image and is resizing the border around the text), determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region (Col 7 Lines 63-67, The text recognizer component 230 is configured to analyze the bounding boxes proposed by the region refining component and can generate a classification vector or other categorization value that indicates the probability that a respective bounding box includes an instance of a certain word. The classification vector can include an entry (i.e., a probability) for each of the categories (e.g., words) the text recognizer component is trained to recognize. In various embodiments, the word recognizer can be a CNN (e.g., a trained word recognition neural network) that maps a bounding box with text to a word.  The examiner interprets the text recognizer will determine if the bounding box includes and instant of a word and will than keep the border if it does.), determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region (Col 8, Lines 7-26, The post processing component 232 can be configured to suppress overlapping words to generate a final set of words. For example, in various embodiments, the output of region proposal filtering component contains a lot of overlapping words. A post processing step can be used to eliminate these duplications. In accordance with various embodiments, the post processing component can perform non-maximum suppression (NMS) of overlapping words. In this example, two kinds of NMS can be used: per-word NMS and cross-word NMS, where NMS can be interleaved with the region refining process. As an example, a variant of bounding box regression called word-end regression can be used. For example, the networks employed are the same, but the extracted regions are only around the two ends of (long) words. In accordance with various embodiments, after several iterations of refinement, the position of the bounding boxes might change. Accordingly, the text recognizer component can be rerun to relabel the bounding boxes. Finally, a grouping step is performed to eliminate words that are contained inside other words. The examiner interprets if there are overlapping words or two ends of along word the borders rejected or will be split into another bounding box.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Zhou with Lin in order to determine borders around text in a image. One skilled in the art would have been motivated to modify Zhou in this manner in order to improve text recognition precision. (Lin, Col 1, Lines 26-27)
Regarding Claim 34, the combination of Zhou and Lin teaches the apparatus of claim 31, wherein the logic is further to: train the scene text detection network with a plurality of image training samples (Zhou, Page 5, Section 3.5 Training, Paragraph 1, The network is trained end-to-end using ADAM [18] optimizer. To speed up learning, we uniformly sample 512x512 crops from images to form a minibatch of size 24. Learning rate of ADAM starts from 1e-3, decays to one-tenth every 27300 minibatches, and stops at 1e-5. The network is trained until performance stops improving.), the scene text detection network including a dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers.  The examiner interprets the feature extractor stem shown in Fig. 3 of the prior art is equivalent to the dense feature portion shown in Fig. 10B of the instant application.),a reverse connections portion communicatively coupled to the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck [8] cuts down the number of channels and reduces computation. The examiner interprets that the feature-merging branch to have the same functionality as the reverse connections portion since  ¶[0042] of the specification of the instant application mentions the reverse connections portion is leverage semantic information from previous layers or stages to increase information flow and the feature merging branch is taking the input from the feature extractor stem and concatenating the data to improve the information flow and it is coupled to the feature extractor stem as shown in figure 3 of the prior art.), and a stage losses portion communicatively coupled to the reverse connections portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In the feature-merging branch, The examiner interprets that the output layer to be equivalent to the stage losses portion and it is coupled to the feature-merging branch as shown in figure 3 of the prior art. Page 4, Section 3.4 Loss functions, Paragraph 1, The loss can be formulated as L = Ls + λgLg (4) where Ls and Lg represents the losses for the score map and the geometry, respectively, and λg weighs the importance between two losses. The examiner interprets the prior art’s output layer is using a loss function to calculate the loss between the score map and the geometry when performing scene text recognition which is similar to the process of the stage loss portion shown in figure 10E)
Regarding Claim 35, the combination of Zhou and Lin teaches the apparatus of claim 34, wherein the logic is further to: support large receptive field features for the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers. Four levels of feature maps, denoted as fi, are extracted from the stem, whose sizes are 1 32 , 1 16 , 1 8 and 1 4 of the input image, respectively. In Fig. 3, PVANet [17] is depicted. In our experiments, we also adopted the well-known VGG16 [32] model, where feature maps after pooling-2 to pooling-5 are extracted. The examiner interprets that the feature extractor stem is extracting 4 levels of feature maps after the pooling layer which is similar to process of the Larger RF features shown in figure 10C of the instant application.)
Regarding Claim 37, Zhou teaches a method of detecting text, comprising: applying a trained scene text detection network (Page 2, Section 3 Methodology, Paragraph 1, The key component of the proposed algorithm is a neural network model, which is trained to directly predict the existence of text instances and their geometries from full image.) to an image to identify a core text region based on the core text region being determined to include text, a supportive text region based on a determination that the supportive text region is associated with the core text region and is mixed with background information, (Page 2, Left Column, Paragraph 1, The contributions of this work are three-fold:  we propose a scene text detection method that consists of two stages: a Fully Convolutional Network and an NMS merging stage. The FCN directly produces text regions, excluding redundant and time-consuming intermediate steps. The pipeline is flexible to produce either word level or line level predictions, whose geometric shapes can be rotated boxes or quadrangles, depending on specific applications. Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, we only consider the case where the geometry is a quadrangle. The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one, illustrated in Fig. 4 (a). For a quadrangle Q = {pi |i ∈ {1, 2, 3, 4}}, where pi = {xi , yi} are vertices on the quadrangle in clockwise order. The examiner interprets that as seen in figure 4(a) the text quadrangle annotated in with yellow dash lines would represent the supportive text region and the shrunk quadrangle would represent the core text region.)
and a background region of the image(Page 1, Section 1 Introduction, Paragraph 1, The core of text detection is the design of features to distinguish text from backgrounds. The examiner interprets the prior art is able to distinguish the text region and background region of the image.), 
generating a first border for the core text region(Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one. As seen in Figure 4(a) shows a green shrunken quadrangle area pertaining to the core text), and detecting text in the image based on the identified core text region and the supportive text region. (Page 8, Section 5 Conclusion and Future Work, Paragraph 1, we have presented a scene text detector that directly produces word or line level predictions from full images with a single neural network. By incorporating proper loss functions, the detector can predict either rotated rectangles or quadrangles for text regions, depending on specific applications. The examiner interprets the prior art is detecting text based on the identified text regions in the image.); 
However Zhou does not explicitly teach expanding the first border to generate a second border, determining that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region, determining that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region, 
Lin teaches expanding the first border to generate a second border(Col 7, Lines 50-55, The region refining component 228 can be configured to refine the bounding box coordinates to more accurately fit the text represented in the image. Refining the bounding box can include changing a size (e.g., bigger, smaller) of a bounding box, repositioning a bounding box, changing a shape of a bounding box, or a combination thereof. The examiner interprets the prior art is generating a border around text regions of an image and is resizing the border around the text), determining that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region (Col 7 Lines 63-67, The text recognizer component 230 is configured to analyze the bounding boxes proposed by the region refining component and can generate a classification vector or other categorization value that indicates the probability that a respective bounding box includes an instance of a certain word. The classification vector can include an entry (i.e., a probability) for each of the categories (e.g., words) the text recognizer component is trained to recognize. In various embodiments, the word recognizer can be a CNN (e.g., a trained word recognition neural network) that maps a bounding box with text to a word.  The examiner interprets the text recognizer will determine if the bounding box includes and instant of a word and will than keep the border if it does.), determining that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region (Col 8, Lines 7-26, The post processing component 232 can be configured to suppress overlapping words to generate a final set of words. For example, in various embodiments, the output of region proposal filtering component contains a lot of overlapping words. A post processing step can be used to eliminate these duplications. In accordance with various embodiments, the post processing component can perform non-maximum suppression (NMS) of overlapping words. In this example, two kinds of NMS can be used: per-word NMS and cross-word NMS, where NMS can be interleaved with the region refining process. As an example, a variant of bounding box regression called word-end regression can be used. For example, the networks employed are the same, but the extracted regions are only around the two ends of (long) words. In accordance with various embodiments, after several iterations of refinement, the position of the bounding boxes might change. Accordingly, the text recognizer component can be rerun to relabel the bounding boxes. Finally, a grouping step is performed to eliminate words that are contained inside other words. The examiner interprets if there are overlapping words or two ends of along word the borders rejected or will be split into another bounding box.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Zhou with Lin in order to determine borders around text in a image. One skilled in the art would have been motivated to modify Zhou in this manner in order to improve text recognition precision. (Lin, Col 1, Lines 26-27)
Regarding Claim 40, the combination of Zhou and Lin teaches the method of claim 37, further comprising: training the scene text detection network with a plurality of image training samples(Zhou, Page 5, Section 3.5 Training, Paragraph 1, The network is trained end-to-end using ADAM [18] optimizer. To speed up learning, we uniformly sample 512x512 crops from images to form a minibatch of size 24. Learning rate of ADAM starts from 1e-3, decays to one-tenth every 27300 minibatches, and stops at 1e-5. The network is trained until performance stops improving.), the scene text detection network including a dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers.  The examiner interprets the feature extractor stem shown in Fig. 3 of the prior art is equivalent to the dense feature portion shown in Fig. 10B of the instant application.),a reverse connections portion communicatively coupled to the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck [8] cuts down the number of channels and reduces computation. The examiner interprets that the feature-merging branch to have the same functionality as the reverse connections portion since  ¶[0042] of the specification of the instant application mentions the reverse connections portion is leverage semantic information from previous layers or stages to increase information flow and the feature merging branch is taking the input from the feature extractor stem and concatenating the data to improve the information flow and it is coupled to the feature extractor stem as shown in figure 3 of the prior art.), and a stage losses portion communicatively coupled to the reverse connections portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In the feature-merging branch, The examiner interprets that the output layer to be equivalent to the stage losses portion and it is coupled to the feature-merging branch as shown in figure 3 of the prior art. Page 4, Section 3.4 Loss functions, Paragraph 1, The loss can be formulated as L = Ls + λgLg (4) where Ls and Lg represents the losses for the score map and the geometry, respectively, and λg weighs the importance between two losses. The examiner interprets the prior art’s output layer is using a loss function to calculate the loss between the score map and the geometry when performing scene text recognition which is similar to the process of the stage loss portion shown in figure 10E)
Regarding Claim 41, the combination of Zhou and Lin teaches the method of claim 40, further comprising: supporting large receptive field features for the dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers. Four levels of feature maps, denoted as fi , are extracted from the stem, whose sizes are 1 32 , 1 16 , 1 8 and 1 4 of the input image, respectively. In Fig. 3, PVANet [17] is depicted. In our experiments, we also adopted the well-known VGG16 [32] model, where feature maps after pooling-2 to pooling-5 are extracted. The examiner interprets that the feature extractor stem is extracting 4 levels of feature maps after the pooling layer which is similar to process of the Larger RF features shown in figure 10C of the instant application.)
Regarding Claim 43, Zhou teaches at least one non-transitory computer readable medium (Page 8, Left column, Paragraph 1, These experiments were conducted on a server using a single NVIDIA Titan X graphic card with Maxwell architecture and an Intel E5-2670 v3 @ 2.30GHz CPU. The examiner interprets that a memory would inherently be coupled to the processor in the prior art.), comprising a set of instructions, which when executed by a computing device, cause the computing device to(Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN):
apply a trained scene text detection network (Page 2, Section 3 Methodology, Paragraph 1, The key component of the proposed algorithm is a neural network model, which is trained to directly predict the existence of text instances and their geometries from full image.) to an image to identify a core text region based on the core text region being determined to include text, a supportive text region based on an identification that the supportive text region is associated with the core text region and is mixed with background information, (Page 2, Left Column, Paragraph 1, The contributions of this work are three-fold:  we propose a scene text detection method that consists of two stages: a Fully Convolutional Network and an NMS merging stage. The FCN directly produces text regions, excluding redundant and time-consuming intermediate steps. The pipeline is flexible to produce either word level or line level predictions, whose geometric shapes can be rotated boxes or quadrangles, depending on specific applications. Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, we only consider the case where the geometry is a quadrangle. The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one, illustrated in Fig. 4 (a). For a quadrangle Q = {pi |i ∈ {1, 2, 3, 4}}, where pi = {xi , yi} are vertices on the quadrangle in clockwise order. The examiner interprets that as seen in figure 4(a) the text quadrangle annotated in with yellow dash lines would represent the supportive text region and the shrunk quadrangle would represent the core text region.)
and a background region of the image(Page 1, Section 1 Introduction, Paragraph 1, The core of text detection is the design of features to distinguish text from backgrounds. The examiner interprets the prior art is able to distinguish the text region and background region of the image.), 
generate a first border for the core text region(Page 4, Section 3.3.1 Score Map Generation for Quadrangle, Paragraph 1, The positive area of the quadrangle on the score map is designed to be roughly a shrunk version of the original one. As seen in Figure 4(a) shows a green shrunken quadrangle area pertaining to the core text), and detect text in the image based on the identified core text region and the supportive text region. (Page 8, Section 5 Conclusion and Future Work, Paragraph 1, we have presented a scene text detector that directly produces word or line level predictions from full images with a single neural network. By incorporating proper loss functions, the detector can predict either rotated rectangles or quadrangles for text regions, depending on specific applications. The examiner interprets the prior art is detecting text based on the identified text regions in the image.)
However Zhou does not explicitly teach expand the first border to generate a second border, determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region, determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region, 
Lin teaches expand the first border to generate a second border(Col 7, Lines 50-55, The region refining component 228 can be configured to refine the bounding box coordinates to more accurately fit the text represented in the image. Refining the bounding box can include changing a size (e.g., bigger, smaller) of a bounding box, repositioning a bounding box, changing a shape of a bounding box, or a combination thereof. The examiner interprets the prior art is generating a border around text regions of an image and is resizing the border around the text), determine that the second border is to be retained  when the second border is positioned to encompass at least a portion of the supportive text region (Col 7 Lines 63-67, The text recognizer component 230 is configured to analyze the bounding boxes proposed by the region refining component and can generate a classification vector or other categorization value that indicates the probability that a respective bounding box includes an instance of a certain word. The classification vector can include an entry (i.e., a probability) for each of the categories (e.g., words) the text recognizer component is trained to recognize. In various embodiments, the word recognizer can be a CNN (e.g., a trained word recognition neural network) that maps a bounding box with text to a word.  The examiner interprets the text recognizer will determine if the bounding box includes and instant of a word and will than keep the border if it does.), determine that the second border is to be rejected when the second border bypasses an encompassment of at least the portion of the supportive text region (Col 8, Lines 7-26, The post processing component 232 can be configured to suppress overlapping words to generate a final set of words. For example, in various embodiments, the output of region proposal filtering component contains a lot of overlapping words. A post processing step can be used to eliminate these duplications. In accordance with various embodiments, the post processing component can perform non-maximum suppression (NMS) of overlapping words. In this example, two kinds of NMS can be used: per-word NMS and cross-word NMS, where NMS can be interleaved with the region refining process. As an example, a variant of bounding box regression called word-end regression can be used. For example, the networks employed are the same, but the extracted regions are only around the two ends of (long) words. In accordance with various embodiments, after several iterations of refinement, the position of the bounding boxes might change. Accordingly, the text recognizer component can be rerun to relabel the bounding boxes. Finally, a grouping step is performed to eliminate words that are contained inside other words. The examiner interprets if there are overlapping words or two ends of along word the borders rejected or will be split into another bounding box.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Zhou with Lin in order to determine borders around text in a image. One skilled in the art would have been motivated to modify Zhou in this manner in order to improve text recognition precision. (Lin, Col 1, Lines 26-27
Regarding Claim 46, the combination of Zhou and Lin teaches The at least one non-transitory computer readable medium of claim 43, comprising a further set of instructions, which when executed by the computing device, cause the computing device to (Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN): train the scene text detection network with a plurality of image training samples(Zhou, Page 5, Section 3.5 Training, Paragraph 1, The network is trained end-to-end using ADAM [18] optimizer. To speed up learning, we uniformly sample 512x512 crops from images to form a minibatch of size 24. Learning rate of ADAM starts from 1e-3, decays to one-tenth every 27300 minibatches, and stops at 1e-5. The network is trained until performance stops improving.), the scene text detection network including a dense features portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers.  The examiner interprets the feature extractor stem shown in Fig. 3 of the prior art is equivalent to the dense feature portion shown in Fig. 10B of the instant application.),a reverse connections portion communicatively coupled to the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck [8] cuts down the number of channels and reduces computation. The examiner interprets that the feature-merging branch to have the same functionality as the reverse connections portion since  ¶[0042] of the specification of the instant application mentions the reverse connections portion is leverage semantic information from previous layers or stages to increase information flow and the feature merging branch is taking the input from the feature extractor stem and concatenating the data to improve the information flow and it is coupled to the feature extractor stem as shown in figure 3 of the prior art.), and a stage losses portion communicatively coupled to the reverse connections portion (Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, A schematic view of our model is depicted in Fig. 3. The model can be decomposed in to three parts: feature extractor stem, feature-merging branch and output layer. In the feature-merging branch, The examiner interprets that the output layer to be equivalent to the stage losses portion and it is coupled to the feature-merging branch as shown in figure 3 of the prior art. Page 4, Section 3.4 Loss functions, Paragraph 1, The loss can be formulated as L = Ls + λgLg (4) where Ls and Lg represents the losses for the score map and the geometry, respectively, and λg weighs the importance between two losses. The examiner interprets the prior art’s output layer is using a loss function to calculate the loss between the score map and the geometry when performing scene text recognition which is similar to the process of the stage loss portion shown in figure 10E)
Regarding Claim 47, the combination of Zhou and Lin teaches the at least one non-transitory computer readable medium of claim 46, comprising a further set of instructions, which when executed by the computing device, cause the computing device to(Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN): supporting large receptive field features for the dense features portion(Zhou, Page 3, Section 3.2 Network Design, Paragraph 3, The stem can be a convolutional network pre-trained on ImageNet [4] dataset, with interleaving convolution and pooling layers. Four levels of feature maps, denoted as fi , are extracted from the stem, whose sizes are 1 32 , 1 16 , 1 8 and 1 4 of the input image, respectively. In Fig. 3, PVANet [17] is depicted. In our experiments, we also adopted the well-known VGG16 [32] model, where feature maps after pooling-2 to pooling-5 are extracted. The examiner interprets that the feature extractor stem is extracting 4 levels of feature maps after the pooling layer which is similar to process of the Larger RF features shown in figure 10C of the instant application.)
Claims 26-27, 32-33, 38-39 and 44-45 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. ("EAST: An Efficient and Accurate Scene Text Detector", published July 10, 2017) in view of Lin et al. US Patent(US 10445569 B1) in further view of Pao et al. US PG-Pub (US 20190019052 A1).
Regarding Claim 26, while the combination of Zhou and Lin teaches the system of claim 25, they don’t explicitly teach wherein the logic is further to: split connected words into one or more word regions based on the identified core text region and the supportive text region
Pao teaches wherein the logic is further to: split connected words into one or more word regions based on the identified core text region and supportive text region(¶[0100] The region detection module 116 determines distances between adjacent text region candidates (block 1106). This operation is part of splitting the text line. For instance, there may be multiple words in the same text line. In this case, if the text line is not split (e.g., broken according to the different words), a text recognition algorithm may erroneously recognize text from these words as a single word.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to split connected words into one or more text regions. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection. (Pao, ¶[0019])
Regarding Claim 27, the combination of Zhou and Lin teaches the system of claim 25, they don’t explicitly teach wherein the logic is further to: remove a word region in response to a lack of core text region pixels in the word region 
Pao teaches wherein the logic is further to: remove a word region in response to a lack of core text region pixels in the word region (¶[0044] The region detection module 116 is then employed by the computing device 102 to detect text region candidates (block 704). This operation can include analysis of low-level pixel information of the digital image 106 by the region detection module 116 to extract various features, such as color, color consistency, stroke width, or other features and to group the pixels in text components accordingly. In an implementation, the region detection module 116 generates representations of the digital image 106 in multiple color spaces, analyzes the representations, determines regions, and classifies the regions into text and non-text region candidates 202. ¶[0067], The region detection module 116 also removes text region candidates that are not matched to candidate text lines as these regions may represent outlier regions. The examiner interprets that the prior art is using the pixel information of the digital image to remove regions that are not word regions.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to separate a word region and background region by classifying the pixels of the image. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection (Pao, ¶[0019])
Regarding Claim 32, while the combination of Zhou and Lin teaches the apparatus of claim 31, 
They don’t explicitly teach wherein the logic is further to: split connected words into one or more word regions based on the identified core text region and supportive text region.
Pao teaches wherein the logic is further to: split connected words into one or more word regions based on the identified core text region and supportive text region (¶[0100] The region detection module 116 determines distances between adjacent text region candidates (block 1106). This operation is part of splitting the text line. For instance, there may be multiple words in the same text line. In this case, if the text line is not split (e.g., broken according to the different words), a text recognition algorithm may erroneously recognize text from these words as a single word.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to split connected words into one or more text regions. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection. (Pao, ¶[0019])
Regarding Claim 33, while the combination of Zhou and Lin teaches the apparatus of claim 31, they do not explicitly teach wherein the logic is further to: remove a word region in response to a lack of core text region pixels in the word region. 
Pao teaches wherein the logic is further to: remove a word region in response to a lack of core text region pixels in the word region (¶[0044] The region detection module 116 is then employed by the computing device 102 to detect text region candidates (block 704). This operation can include analysis of low-level pixel information of the digital image 106 by the region detection module 116 to extract various features, such as color, color consistency, stroke width, or other features and to group the pixels in text components accordingly. In an implementation, the region detection module 116 generates representations of the digital image 106 in multiple color spaces, analyzes the representations, determines regions, and classifies the regions into text and non-text region candidates 202. ¶[0067], The region detection module 116 also removes text region candidates that are not matched to candidate text lines as these regions may represent outlier regions. The examiner interprets that the prior art is using the pixel information of the digital image to remove regions that are not word regions.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to separate a word region and background region by classifying the pixels of the image. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection (Pao, ¶[0019])
Regarding Claim 38, the combination of Zhou and Lin teaches the method of claim 37, they do not explicitly teach further comprising: splitting connected words into one or more word regions based on the identified core text region and supportive text region.
Pao teaches further comprising: splitting connected words into one or more word regions based on the identified core text region and supportive text region (¶[0100] The region detection module 116 determines distances between adjacent text region candidates (block 1106). This operation is part of splitting the text line. For instance, there may be multiple words in the same text line. In this case, if the text line is not split (e.g., broken according to the different words), a text recognition algorithm may erroneously recognize text from these words as a single word.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to split connected words into one or more text regions. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection. (Pao, ¶[0019])
Regarding Claim 39, while the combination of Zhou and Lin teaches the method of claim 37, they do not explicitly teach further comprising: removing a word region in response to a lack of core text region pixels in the word region.
Pao teaches further comprising: removing a word region in response to a lack of core text region pixels in the word region (¶[0044] The region detection module 116 is then employed by the computing device 102 to detect text region candidates (block 704). This operation can include analysis of low-level pixel information of the digital image 106 by the region detection module 116 to extract various features, such as color, color consistency, stroke width, or other features and to group the pixels in text components accordingly. In an implementation, the region detection module 116 generates representations of the digital image 106 in multiple color spaces, analyzes the representations, determines regions, and classifies the regions into text and non-text region candidates 202. ¶[0067], The region detection module 116 also removes text region candidates that are not matched to candidate text lines as these regions may represent outlier regions. The examiner interprets that the prior art is using the pixel information of the digital image to remove regions that are not word regions.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to separate a word region and background region by classifying the pixels of the image. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection (Pao, ¶[0019])
Regarding Claim 44, the combination of Zhou and Lin teaches the at least one non-transitory computer readable medium of claim 43, comprising a further set of instructions, which when executed by the computing device, cause the computing device to(Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN):
They do not explicitly teach split connected words into one or more word regions based on the identified core text region and supportive text region.
Pao teaches split connected words into one or more word regions based on the identified core text region and supportive text region (¶[0100] The region detection module 116 determines distances between adjacent text region candidates (block 1106). This operation is part of splitting the text line. For instance, there may be multiple words in the same text line. In this case, if the text line is not split (e.g., broken according to the different words), a text recognition algorithm may erroneously recognize text from these words as a single word.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to split connected words into one or more text regions. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection. (Pao, ¶[0019])
Regarding Claim 45, the combination of  Zhou and Lin teaches the at least one non-transitory computer readable medium of claim 43, comprising a further set of instructions, which when executed by the computing device, cause the computing device to (Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN):
They do not explicitly teach remove a word region in response to a lack of core text region pixels in the word region.
Pao teaches remove a word region in response to a lack of core text region pixels in the word region. (¶[0044] The region detection module 116 is then employed by the computing device 102 to detect text region candidates (block 704). This operation can include analysis of low-level pixel information of the digital image 106 by the region detection module 116 to extract various features, such as color, color consistency, stroke width, or other features and to group the pixels in text components accordingly. In an implementation, the region detection module 116 generates representations of the digital image 106 in multiple color spaces, analyzes the representations, determines regions, and classifies the regions into text and non-text region candidates 202. ¶[0067], The region detection module 116 also removes text region candidates that are not matched to candidate text lines as these regions may represent outlier regions. The examiner interprets that the prior art is using the pixel information of the digital image to remove regions that are not word regions.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Pao to Zhou and Lin in order to separate a word region and background region by classifying the pixels of the image. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve computational efficiency in text region detection (Pao, ¶[0019])
Claims 30, 36, 42 and 48 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. ("EAST: An Efficient and Accurate Scene Text Detector", published July 10, 2017) in view of Lin et al. US Patent(US 10445569 B1) in further view of Shrivastava et al. ("Training Region-based Object Detectors with Online Hard Example Mining", published April 12, 2016).
Regarding Claim 30, while the combination of Zhou and Lin teaches the system of claim 28, they do not explicitly teach wherein the logic is further to: train the scene text detection network with a plurality of online hard examples mining training samples.
Shrivastava teaches wherein the logic is further to: train the scene text detection network with a plurality of online hard examples mining training samples (Page 2, Left Column, Paragraph 2, In this paper, we propose a novel bootstrapping technique called online hard example mining1 (OHEM) for training state-of-the-art detection models based on deep ConvNets. The algorithm is a simple modification to SGD in which training examples are sampled according to a non-uniform, non-stationary distribution that depends on the current loss of each example under consideration.). 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Shrivastava to Zhou and Lin in order to train the neural network using online hard example mining training samples. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve the effectiveness of the neural network datasets become larger and more difficult. (Shrivastava, Abstract)
Regarding Claim 36, while the combination of Zhou and Lin teaches the apparatus of claim 34, they do not explicitly teach wherein the logic is further to: train the scene text detection network with a plurality of online hard examples mining training samples
Shrivastava teaches wherein the logic is further to: train the scene text detection network with a plurality of online hard examples mining training samples (Page 2, Left Column, Paragraph 2, In this paper, we propose a novel bootstrapping technique called online hard example mining1 (OHEM) for training state-of-the-art detection models based on deep ConvNets. The algorithm is a simple modification to SGD in which training examples are sampled according to a non-uniform, non-stationary distribution that depends on the current loss of each example under consideration.). 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Shrivastava to Zhou and Lin in order to train the neural network using online hard example mining training samples. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve the effectiveness of the neural network datasets become larger and more difficult. (Shrivastava, Abstract)
Regarding Claim 42, while the combination of Zhou and Lin teaches the method of claim 40, they do not explicitly teach further comprising: training the scene text detection network with a plurality of online hard examples mining training samples.
Shrivastava teaches further comprising: training the scene text detection network with a plurality of online hard examples mining training samples. (Page 2, Left Column, Paragraph 2, In this paper, we propose a novel bootstrapping technique called online hard example mining1 (OHEM) for training state-of-the-art detection models based on deep ConvNets. The algorithm is a simple modification to SGD in which training examples are sampled according to a non-uniform, non-stationary distribution that depends on the current loss of each example under consideration.). 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Shrivastava to Zhou and Lin in order to train the neural network using online hard example mining training samples. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve the effectiveness of the neural network datasets become larger and more difficult. (Shrivastava, Abstract)
Regarding Claim 48, while the combination of Zhou and Lin teaches the at least one non-transitory computer readable medium of claim 46, comprising a further set of instructions, which when executed by the computing device, cause the computing device to (Zhou, Page 3, Section 3.1. Pipeline, Paragraph 1, A high-level overview of our pipeline is illustrated in Fig. 2(e). The algorithm follows the general design of DenseBox [9], in which an image is fed into the FCN and multiple channels of pixel-level text score map and geometry are generated. The examiner interprets that prior art has an algorithm that is coupled to the CPU which performs the scene text recognition by an FCN):
They do not explicitly teach train the scene text detection network with a plurality of online hard examples mining training samples.
Shrivastava teaches train the scene text detection network with a plurality of online hard examples mining training samples (Page 2, Left Column, Paragraph 2, In this paper, we propose a novel bootstrapping technique called online hard example mining1 (OHEM) for training state-of-the-art detection models based on deep ConvNets. The algorithm is a simple modification to SGD in which training examples are sampled according to a non-uniform, non-stationary distribution that depends on the current loss of each example under consideration.). 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Shrivastava to Zhou and Lin in order to train the neural network using online hard example mining training samples. One skilled in the art would have been motivated to modify Zhou and Lin in this manner in order to improve the effectiveness of the neural network datasets become larger and more difficult. (Shrivastava, Abstract)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344. The examiner can normally be reached Monday-Friday 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire X. Wang can be reached on (571) 270-1051. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HAN HOANG/Examiner, Art Unit 2663                                                                                                                                                                                                        
/CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663