DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
The present application is being examined under the claims filed 03/21/2018. 
Claims 1-20 are pending.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference character “104” has been used to designate both “Document Management System” and “Analytics Engine”. It is unclear if these are the same component or different. The specifications describe “104” as the “Document Management System”; there is no mention of “Analytics Engine” in the specification. 
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities: 
The claim language “memory component” from claim 15 is not present in the specification. 
On page 14, the paragraph after ¶44 is labelled as ¶100 and the paragraph after ¶100 is labelled ¶45. All subsequent paragraph numbers are one less than they should be. 
Appropriate correction is required.
The use of the terms “BLUETOOTH”, “WI-FI”, “WI-MAX”, “GSM” in ¶131 of the instant specification, which are trade names or marks used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
 
Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim limitation “a memory component comprising” in claim 15 invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The instant specification did not have the claim language “memory component.” 

Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph. 
For examination purposes, the “memory component” is interpreted to be memory used for storing data, metadata, and programs for execution by the processor as per ¶127 of the specification.

Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Maggiori (“High-Resolution Semantic Labeling with Convolutional Neural Networks”) (herein thereafter Maggiori) in view of Ajward et al. (“Converting Printed Sinhala Documents to Formatted Editable Text”) (herein thereafter Ajward).  

Regarding Claim 1:
Maggiori teaches semantic labeling using convolutional neural networks. Maggiori teaches:
performing a step for training a form conversion neural network to determine low-level semantic characteristics and high-level semantic characteristics of digitized paper forms (Examiner notes that “digitized paper forms” are merely images. Examiner notes that a “form conversion neural network” is interpreted as a convolutional neural network (CNN) as per ¶38 of the instant specifications. Low-level and high-level semantic characteristics are interpreted as low-level features and high-level features, respectively. Maggiori teaches an encoder-decoder CNN to extract hierarchical features (i.e. low-level features and high-level features) of images. Maggiori teaches training this encoder-decoder network in order to find the most optimal CNN for classification. [Maggiori Abstract: “In this paper we address the problem of dense semantic labeling, which consists in assigning a semantic label to every pixel in an image. […] Out of these observations, we then derive a CNN framework specifically adapted to the semantic labeling problem. In addition to learning features at different resolutions, it learns how to combine these features.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec II ¶5: “Finding the optimal neural network classifier reduces to finding the weights and biases that minimize a loss L between the predicted labels and the target labels in a training set.”, sec II ¶6: “The loss function L quantifies the misclassification by comparing the target label vectors y(i) and the predicted label vectors ŷ(i), for n training samples i = 1 … n. In this work we use the common cross-entropy loss, defined as:                                 
                                    L
                                    =
                                    -
                                    
                                        
                                            1
                                        
                                        
                                            n
                                        
                                    
                                    
                                        
                                            ∑
                                            
                                                i
                                                =
                                                1
                                            
                                            
                                                n
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        k
                                                        =
                                                        1
                                                    
                                                    
                                                        |
                                                        L
                                                        |
                                                    
                                                
                                                
                                                    
                                                        
                                                            y
                                                        
                                                        
                                                            k
                                                        
                                                        
                                                            (
                                                            i
                                                            )
                                                        
                                                    
                                                    l
                                                    o
                                                    g
                                                    
                                                        
                                                             
                                                            
                                                                
                                                                    y
                                                                
                                                                ^
                                                            
                                                        
                                                        
                                                            k
                                                        
                                                        
                                                            (
                                                            i
                                                            )
                                                        
                                                    
                                                
                                            
                                            .
                                             
                                        
                                    
                                
                             Training neural networks by optimizing this criterion converges faster”]).  
by taking into account a plurality of losses from a plurality of decoder branches; (Examiner notes that decoder branches are defined as multiple runs of the decoder. Maggiori teaches that the training of the encoder-decoder network takes into account losses. The losses are calculated from the result of the classifications which occurs in the decoder. Maggiori teaches softmax normalization to obtain probabilities in the last layer which is part of the decoder. This is equivalent to the instant application in which (as per ¶80 of the instant specifications) the last layer is a softmax layer to obtain probabilities. The instant specification states, “As can be seen in Table 5, the low-level semantic decoder 610 ends with a neural network layer employing a SoftMax classifier. In particular, the SoftMax classifier provides a predicted classification of a pixel based on the probabilities of the element type of that pixel.” [Maggiori sec II ¶5: “Finding the optimal neural network classifier reduces to finding the weights and biases that minimize a loss L between the predicted labels and the target labels in a training set. Let 𝓛 be the set of possible semantic classes; labels are typically encoded as a vector of length |𝓛| with value ‘1’ at the position of the correct label and ‘0’ elsewhere. The network contains thus as many output neurons as possible labels. A softmax normalization is performed on top of the last layer to guarantee that the output is a probability distribution, i.e. the label values are between zero and one and sum to one.”, ¶7: “In practice, instead of averaging over the full dataset, the loss (2) is estimated from a random small subset of the training set, referred to as a mini-batch.”]). 
performing a step for determining low-level semantic characteristics and high-level semantic characteristics of a digitized paper form using the trained form conversion neural network; (After the CNN is trained, Maggiori teaches using encoder-decoder convolutional neural network (i.e. the trained form conversion neural network) to learn hierarchical features of an image: low level and high level. Examiner notes that a digitized paper form is merely an image. [Maggiori Abstract: “In this paper we address the problem of dense semantic labeling, which consists in assigning a semantic label to every pixel in an image. […] Out of these observations, we then derive a CNN framework specifically adapted to the semantic labeling problem. In addition to learning features at different resolutions, it learns how to combine these features.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”]).  
and generating a [fillable] digital form based on the determined low-level semantic characteristics and the determined high-level semantic characteristics of the digitized paper form. (Examiner notes that digital form is merely an image. Maggiori teaches generating a digital segmentation map based on the determined high level and low-level features of the image (i.e. digitized paper). As seen in Fig. 9, the pixels are labeled as building (blue), tree (green), and low vegetation (cyan). The segmentation is based on edge detection (low-level features) and object assembly (high-level features) such as assembling a building if the pixel is red and is part of a larger red rectangular structure that is surrounded by vegetation and a road. [Maggiori sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts).”, sec. IV ¶9: “An example of the type of relation we are able to convey in this scheme is as follows: “label a pixel as building if it is red and belongs to a larger red rectangular structure, which is surrounded by areas of green vegetation and near a road”, Fig. 8 caption: “Classification of closeups of Vahingen (1–3) and Potsdam (4–6) validation sets. Classes: Impervious surface (white), Building (blue), Low veget. (cyan), Tree (green), Car (yellow), Clutter (red).” Fig. 9: 
    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
]). 
Maggiori does not teach “In a digital media environment in which paper forms are converted into corresponding fillable digital forms” or “and generating a fillable digital form based on the determined low-level semantic characteristics and the determined high-level semantic characteristics of the digitized paper form.”
Ajward teaches a method of digitizing printed documents to editable text based on combining optical character recognition (OCR) and layout reconstruction. Ajward teaches: 
In a digital media environment in which paper forms are converted into corresponding fillable digital forms, a method comprising: (Ajward teaches a method for converting scanned forms into editable text (i.e. fillable digital forms). [Ajward sec. 1 ¶1: “The typical process of digitizing text document is performed by scanning the printed copies to images and converting them to editable text”, Fig. 1: 
    PNG
    media_image2.png
    444
    507
    media_image2.png
    Greyscale
]). 
and generating a fillable digital form based on the determined [low-level] semantic characteristics [and the determined high-level] semantic characteristics of the digitized paper form. (Ajward teaches generating editable text (i.e. a fillable digital form) based on the semantic characteristics of digitized paper forms. As seen above, the determined low-level and high-level semantic characteristics were taught by Maggiori and this limitation is taught by the combination of Maggiori and Ajward. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, Fig. 1: 
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 
Maggiori, Ajward, and the instant application are analogous art because they are all directed to using neural networks for classification.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the encoder-decoder CNN disclosed by Maggiori to include conversion from paper forms to fillable digital forms as taught by Ajward. One would be motivated to do so to help users easily modify printed documented and word-search in printed documents, as suggested by Ajward  (Ajward Abstract: “Digitization of text not only allows users to easily modify and reprint printed documents, but also is a need of the day due to the use of word-search capability available at disposal in this era.”). 

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Maggiori in view of Ajward and further in view of Zhang et al. (“Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification”) (herein thereafter Zhang). 

Regarding Claim 2:
Maggiori in view of Ajward teaches “The method of claim 1” as seen above. 
Ajward further teaches:  
further comprising: generating a reconstructed layout corresponding to the digitized paper form [using a trained form conversion neural network reconstruction decoder], wherein generating the fillable digital form is further based on the reconstructed layout (Ajward teaches reconstructing the layout of the digitized paper form and generating the fillable digital form by combining the reconstructed layout and recognized text. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, Fig. 1:  
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 

Neither Maggiori nor Ajward teach, “generating a reconstructed layout corresponding to the digitized paper form using a trained form conversion neural network reconstruction decoder.”
Zhang teaches a method for classification using neural networks. Zhang teaches:
generating a reconstructed layout corresponding to the digitized paper [form] using a trained form conversion neural network reconstruction decoder, (Examiner notes that digitized paper is merely an image. Examiner notes that a “form conversion neural network” is interpreted as a convolutional neural network as per ¶38 of the instant specifications. Examiner further notes that a layout is the way in which parts of something are arranged and through reconstructing the image, the layout of the image will also be reconstructed. Zhang teaches augmenting an encoder-decoder convolutional neural network (i.e. a conversion neural network) with an auxiliary decoding pathway used for reconstruction of an image. [Zhang sec. 1 ¶4: “we augment challenge-winning neural networks with decoding pathways for reconstruction, demonstrating the feasibility of improving high-capacity networks for largescale image classification. Specifically, we take a segment of the classification network as the encoder and use the mirrored architecture as the decoding pathway to build several autoencoder variants.”, sec. 3.2 para 3: “The auxiliary training signals of SAE-first emerge from the bottom of the decoding pathway, and they get merged with the top-down signals for classification at the last convolution-pooling macro-layer into the encoder pathway.”, Fig. 5: 
    PNG
    media_image4.png
    122
    643
    media_image4.png
    Greyscale
]).  
The combined system of Maggiori and Ajward, Zhang, and the instant application are analogous art because they are all directed to using neural networks for classification.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the document digitization method disclosed by Maggiori in view of Ajward to include convolutional neural network reconstruction decoder as taught by Zhang. One would be motivated to do so to improve performance by reducing error, as suggested by Zhang (Zhang sec. 5 ¶1: “We proposed a simple and effective way to incorporate unsupervised objectives into large-scale classification network learning by augmenting the existing network with reconstructive decoding pathways. […] This method improved the performance of the 16-layer VGGNet, one of the best existing networks for image classification by a noticeable margin.”, sec. 4.3 ¶6: “In particular, compared to the VGGNet baseline, the SWWAE-all model reduced the top-1 errors by 1.66% and 1.18% for the single-crop and convolution schemes, respectively. It also reduced the top-5 errors by 1.01% and 0.81%, which are 10% and 9% relative to the baseline errors.”).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Maggiori in view of Ajward and further in view of Corbelli et al. (“Historical Document Digitization through Layout Analysis and Deep Content Classification”) (herein thereafter Corbelli). 

Regarding Claim 3:
Maggiori in view of Ajward teaches “The method of claim 1” as seen above. 
Maggiori teaches:
wherein the low-level semantic characteristics indicates one or more of a text run, a widget, an image, or an element border, (Maggiori teaches low-level features being edges. Element borders are equivalent to edges. Maggiori also teaches parts of images (such as areas of green or red) being features. [Maggiori sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. IV ¶9: “An example of the type of relation we are able to convey in this scheme is as follows: “label a pixel as building if it is red and belongs to a larger red rectangular structure, which is surrounded by areas of green vegetation and near a road”.”]). 
While Maggiori teaches low level semantic characteristics, Maggiori does not explicitly teach high-level semantic characteristics. Ajward also does not explicitly teach high-level semantic characteristics.  
	Corbelli teaches a system for document digitization. Corbelli teaches:
wherein the high-level semantic characteristics indicate one or more of a text block, a field, a list, or a table. (Corbelli teaches extracting features such as text blocks and tables in documents. [Corbelli sec. III. B ¶1-2: “In the case of the “Enciclopedia Treccani”, there are seven different classes: text, tables with border, borderless tables, images, graphics, scores and mathematical formulas. […] Given an input region, a Convolutional Neural Network (CNN) is used to produce local features from squared nxn blocks,”, Fig. 3: 

    PNG
    media_image5.png
    349
    994
    media_image5.png
    Greyscale
]). 
The combined system of Maggiori and Ajward, Corbelli, and the instant application are analogous art because they are all directed to document digitization using convolutional neural networks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the document digitization method disclosed by Maggiori in view of Ajward to include high-level semantic characteristics indicating text blocks as taught by Corbelli. One would be motivated to do so to improve performance, as suggested by Corbelli (Corbelli sec. IV ¶4: “our hybrid approach outperforms other classic algorithms by a large margin.”).

Claims 4-5, 8-14 are rejected under 35 U.S.C. 103 as being unpatentable over Maggiori in view of Ajward and Simantov et al. (US10936863) (herein thereafter Simantov). 

Regarding Claim 4:
Maggiori teaches semantic labeling using convolutional neural networks. Maggiori teaches:
use a form conversion neural network trained to determine semantic structures of digitized paper forms by: (Examiner notes that a conversion neural network is interpreted as a convolutional neural network as per ¶38 of the instant specifications and semantic structures is interpreted to be semantic classes. Examiner further notes that digitized paper forms are merely images. Maggiori teaches using a convolutional neural network (CNN) for semantic labeling of images. [Maggiori Abstract: “In this paper we address the problem of dense semantic labeling, which consists in assigning a semantic label to every pixel in an image. […] Out of these observations, we then derive a CNN framework specifically adapted to the semantic labeling problem. In addition to learning features at different resolutions, it learns how to combine these features.”]). 
generating a feature map by processing a digitized paper form comprising a plurality of pixels using a neural network encoder; (Examiner notes that a “digitized paper form” is merely an image. Maggiori processing an image comprising of pixels using an encoder-decoder convolutional neural network for semantic labeling. The encoder of the encoder-decoder generates feature maps. [Maggiori Abstract ¶2: “In this paper we address the problem of dense semantic labeling, which consists in assigning a semantic label to every pixel in an image.”, sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities. […] The first layer of the encoder takes as input as many channels there are in the input image”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec. II. C. ¶4: “This gave birth to the so-called deconvolutional (or upconvolutional) layer, which upsamples a feature map by interpolating neighboring elements (as the last layer in Fig. 1).”]).
determining one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using a neural network low-level semantic decoder; (Examiner notes that low-level semantic characteristics are interpreted as low-level features and low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These low-level feature maps are processed by a decoder to produce segmentation maps.  [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: 
 
    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
 ]).
and determining one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using one or more neural network high-level semantic decoders; (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These high-level feature maps are processed by a decoder to produce segmentation maps. [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: see above).  
Maggiori does not teach “A non-transitory computer readable storage medium including a set of instructions that, when executed by at least one processor, cause a computing device to” or “and generate a fillable digital form based on the determined one or more low-level semantic characteristics and the determined one or more high-level semantic characteristics.”  
Ajward teaches a method of digitizing printed documents to editable text based on combining optical character recognition (OCR) and layout reconstruction. Ajward teaches: 
and generate a fillable digital form based on the determined one or more [low-level] semantic characteristics and [the determined one or more high-level] semantic characteristics. (Ajward teaches generating editable text (i.e. a fillable digital form) based on the semantic characteristics of digitized paper forms. As seen above, the determined low-level and high-level semantic characteristics were taught by Maggiori and this limitation is taught by the combination of Maggiori and Ajward. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, Fig. 1: 
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 
Maggiori, Ajward, and the instant application are analogous art because they are all directed to using neural networks for classification.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the encoder-decoder CNN disclosed by Maggiori to include conversion from paper forms to fillable digital forms as taught by Ajward. One would be motivated to do so to help users easily modify printed documented and word-search in printed documents, as suggested by Ajward  (Ajward Abstract: “ Digitization of text not only allows users to easily modify and reprint printed 
Ajward does not teach “A non-transitory computer readable storage medium including a set of instructions that, when executed by at least one processor, cause a computing device to:”
	Simantov teaches a system to determine semantic information from scanned documents using convolutional neural networks. Simantov teaches:
A non-transitory computer readable storage medium including a set of instructions that, when executed by at least one processor, cause a computing device to: (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
The combined system of Maggiori and Ajward, Simantov, and the instant application are analogous art because they are all directed to using neural networks for semantic analysis of scanned documents.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the document digitization method disclosed by Maggiori in view of Ajward to include convolutional neural networks and hierarchical features as taught by Simantov. One would be motivated to do so to improve classification accuracy, as suggested by Simantov (Simantov col 22 lines 

Regarding Claim 5:
Maggiori in view of Ajward and Simantov teach the “The non-transitory computer readable storage medium of claim 4” as seen above. 
Simantov further teaches: 
further comprising instructions that, when executed by the at least one processor, cause the computing device to: (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
Maggiori further teaches:
generate a low-level semantic segmentation map based on the determined one or more low- level semantic characteristics for the plurality of pixels; (Examiner notes that low-level semantic characteristics are interpreted as low-level features. Examiner further notes that low level semantic segmentation map is not defined and is interpreted as a segmentation map that was created using low-level features. Maggiori teaches a resulting segmentation map based on the low-level features. [Maggiori sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information”, Fig. 9: 
    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
]).  
and generate one or more high-level semantic segmentation maps based on the determined one or more high-level semantic characteristics for the plurality of pixels. (Examiner notes that high-level semantic characteristics are interpreted as high-level features. Examiner notes that high level semantic segmentation map is not defined and is interpreted as the digitized paper form showing at least the segmentation of high-level semantic characteristics. Maggiori teaches a resulting segmentation map based on the high-level features.  [Maggiori sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information”, Fig. 9: see above]). 


Regarding Claim 8:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 4” as seen above. 
Maggiori further teaches: 
wherein determining one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using a neural network low-level semantic decoder comprises: (Examiner notes that low-level semantic characteristics are interpreted as low-level features and low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These low-level feature maps are processed by a decoder to produce segmentation maps.  [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: 
 
    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
 ]).
Simantov further teaches:
passing an encoder-layer-level feature map from a layer of the neural network encoder directly to a corresponding layer of the neural network low-level semantic decoder using a skip connection; (Examiner notes that low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. (herein thereafter Ronneberger) and is the architecture used by Simantov. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per Ronneberger and “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al., the low-level feature maps from the encoder are processed by the decoder. Simantov teaches using a U-Net which uses skip connections to pass encoder layer level feature maps directly to corresponding layers of the decoder (shown highlighted in Fig. 10 of Simantov and Fig. 1 of Ronneberger). [Simantov col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Fig. 10: 
    PNG
    media_image6.png
    396
    371
    media_image6.png
    Greyscale

Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale

Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger Fig. 1: 
    PNG
    media_image8.png
    609
    728
    media_image8.png
    Greyscale
]). 
and combining the encoder-layer-level feature map with a previous feature map of a previous layer of the neural network low-level semantic decoder. (Examiner notes that low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using U-Net for segmentation. An instance of encoder-layer-level feature map with a previous feature map of a previous layer of the neural network low-level semantic decoder is shown highlighted below in Fig. 1 of Ronneberger. In Fig. 1, it can be seen that the resulting feature map from an encoder layer is copied and combined with the feature map from a previous decoder layer via concatenation. [Ronneberger sec. 2 ¶1: “The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). […] Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.”, Fig. 1: 
    PNG
    media_image9.png
    609
    728
    media_image9.png
    Greyscale
    ]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Regarding Claim 9:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 4” as seen above. 
Maggiori further teaches: 
wherein determining one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using the one or more neural network high-level semantic decoders comprises: (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes high-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These high-level feature maps are processed by a decoder to produce segmentation maps.  [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: 
 
    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
 ]).  
Simantov further teaches:
passing an encoder-layer-level feature map from a layer of the neural network encoder directly to a corresponding layer of a neural network high-level semantic decoder of the one or more neural network high-level semantic decoders using a skip connection; (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. (herein thereafter Ronneberger) and the architecture cited by Ronneberger is cited in Simantov as the architecture used. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al. and Ronneberger et al., the high-level feature maps from the encoder are processed by the decoder. [Simantov col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Fig. 10: 
    PNG
    media_image6.png
    396
    371
    media_image6.png
    Greyscale

Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale

Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger Fig. 1: 
    PNG
    media_image8.png
    609
    728
    media_image8.png
    Greyscale
]). 
and combining the encoder-layer-level feature map with a previous feature map of a previous layer of the neural network high-level semantic decoder. (Examiner notes that low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using U-Net for segmentation. An instance of encoder-layer-level feature map with a previous feature map of a previous layer of the neural network low-level semantic decoder is shown highlighted below in Fig. 1 of Ronneberger. In Fig. 1, it can be seen that the resulting feature map from an encoder layer is copied and combined with the feature map from a previous decoder layer via concatenation. [Ronneberger sec. 2 ¶1: “The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). […] Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.”, Fig. 1: 

    PNG
    media_image9.png
    609
    728
    media_image9.png
    Greyscale
]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Regarding Claim 10:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 4” as seen above. 
Simantov further teaches:
further comprising instructions that, when executed by the at least one processor, cause the computing device to (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
use the form conversion neural network trained to determine semantic structures of digitized paper forms by generating a common trunk feature map by processing the feature map using a form conversion neural network common semantic segmentation trunk. (Examiner notes that a conversion neural network is interpreted as a convolutional neural network (CNN) as per ¶38 of the instant specifications and that a digitized paper form is merely an image. Examiner notes that a “form conversion neural network common semantic segmentation trunk” or “common trunk feature map” is not defined. As per ¶106 of the instant specifications, the form conversion neural network passes an encoder-layer-level feature map from a layer of the neural network encoder directly to a corresponding layer of the form conversion neural network common semantic segmentation trunk using the skip connection and combines the encoder-layer-level feature map with a previous feature map of a previous layer of the form conversion neural network common semantic segmentation trunk. The form conversion neural network common semantic segmentation trunk is interpreted to be the decoder and the common trunk feature map is interpreted to be the feature map of layer of the decoder as decoders perform the same function. Simantov teaches using U-Net, comprised of encoder and decoder, to determine semantic information. In the highlighted portion of Fig. 1 of Ronneberger, it can be seen that the feature map resulting from an encoder layer is processed by a decoder layer by convolution to generate a new feature map (i.e. a common trunk feature map). [Simantov col 14 lines 52-56: “Thus, the original invoice image 642 and the embedding channels 646 are passed into a convolutional neural network (CNN), such that it may extract deeper relational meanings from locations, image properties and the word-content of the invoice image.”, col 1 lines 47-49: “FIG. 10 is a schematic block diagram representing one possible system architecture of an encoder-decoder patterned convolutional network (U-Net), according to one embodiment of the current disclosure”, col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, 
Fig. 10: 
    PNG
    media_image10.png
    396
    371
    media_image10.png
    Greyscale

Ronneberger Fig. 1: 
    PNG
    media_image11.png
    609
    728
    media_image11.png
    Greyscale
]).


Regarding Claim 11:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 10” as seen above. 
Simantov further teaches:
wherein generating a common trunk feature map by processing the feature map using the form conversion neural network common semantic segmentation trunk comprises: (Examiner notes that a “form conversion neural network common semantic segmentation trunk” or “common trunk feature map” is not defined. As per ¶106 of the instant specifications, the form conversion neural network passes an encoder-layer-level feature map from a layer of the neural network encoder directly to a corresponding layer of the form conversion neural network common semantic segmentation trunk using the skip connection and combines the encoder-layer-level feature map with a previous feature map of a previous layer of the form conversion neural network common semantic segmentation trunk. The form conversion neural network common semantic segmentation trunk is interpreted to be the decoder and the common trunk feature map is interpreted to be the feature map of layer of the decoder as decoders perform the same function. Simantov teaches using U-Net, comprised of encoder and decoder, to determine semantic information. In the highlighted portion of Fig. 1 of Ronneberger, it can be seen that the feature map resulting from an encoder layer is processed by a decoder layer by convolution to generate a new feature map (i.e. a common trunk feature map). [Simantov col 14 lines 52-56: “Thus, the original invoice image 642 and the embedding channels 646 are passed into a convolutional neural network (CNN), such that it may extract deeper relational meanings from locations, image properties and the word-content of the invoice image.”, col 1 lines 47-49: “FIG. 10 is a schematic block diagram representing one possible system architecture of an encoder-decoder patterned convolutional network (U-Net), according to one embodiment of the current disclosure”, col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, 
Fig. 10: 
    PNG
    media_image10.png
    396
    371
    media_image10.png
    Greyscale

Ronneberger Fig. 1:  
    PNG
    media_image11.png
    609
    728
    media_image11.png
    Greyscale
]).
passing an encoder-layer-level feature map from a layer of the neural network encoder directly to a corresponding layer of the form conversion neural network common semantic segmentation trunk using a skip connection; (Examiner notes that a “form conversion neural network common semantic segmentation trunk” is not defined and is interpreted as a decoder as per ¶106 of the instant specifications because decoders perform the same function (see above). Simantov teaches using a U-Net which uses skip connections to pass encoder layer level feature maps directly to corresponding layers of the decoder (shown highlighted in Fig. 10 of Simantov and Fig. 1 of Ronneberger). [Simantov col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Fig. 10: 
    PNG
    media_image6.png
    396
    371
    media_image6.png
    Greyscale
, 
Ronneberger Fig. 1: 
    PNG
    media_image8.png
    609
    728
    media_image8.png
    Greyscale
]). 
and combining the encoder-layer-level feature map with a previous feature map of a previous layer of the form conversion neural network common semantic segmentation trunk to generate a combined feature map. (Examiner notes that a “form conversion neural network common semantic segmentation trunk” is not defined and is interpreted as a decoder as per ¶106 of the instant specifications because decoders perform the same function (see above). Simantov teaches using U-Net comprising of an encoder and decoder. An instance of combining the encoder-layer-level feature map with a previous feature map of a previous layer of the neural network low-level semantic decoder is shown highlighted below in Fig. 1 of Ronneberger. In Fig. 1, it can be seen that the resulting feature map from an encoder layer is copied and combined with the feature map from a previous decoder layer via concatenation. [Ronneberger sec. 2 ¶1: “The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). […] Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.”, Fig. 1: 
    PNG
    media_image9.png
    609
    728
    media_image9.png
    Greyscale
]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Regarding Claim 12:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 11” as seen above. 
Simantov further teaches:
wherein combining the encoder-layer-level feature map with the previous feature map comprises concatenating the encoder-layer-level feature map with the previous feature map. (Simantov teaches using U-Net comprising of an encoder and decoder. In Fig. 1, it can be seen that the resulting feature map from an encoder layer is copied and combined with the feature map from a previous decoder layer via concatenation.  [Ronneberger sec. 2 ¶1: “The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). […] Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.”, Fig. 1: 
    PNG
    media_image9.png
    609
    728
    media_image9.png
    Greyscale
]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Regarding Claim 13:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 11” as seen above.
Maggiori further teaches:
wherein: determining one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using a neural network low-level semantic decoder comprises (Examiner notes that low-level semantic characteristics are interpreted as low-level features and low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These low-level feature maps are processed by a decoder to produce segmentation maps.  [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: see above]). 
processing the common trunk feature map using the neural network low-level semantic decoder, (Examiner notes that low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Examiner notes that “common trunk feature map” is interpreted as a feature map resulting from a layer of the decoder (see above for further detail). Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. Corresponding layers of the decoder process the feature maps to generate a new feature map. The resulting decoder feature maps are then processed by the next layer of the decoder. [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec. II. C. ¶4: “This gave birth to the so-called deconvolutional (or upconvolutional) layer, which upsamples a feature map by interpolating neighboring elements (as the last layer in Fig. 1).”, sec II ¶4: “Instead of directly connecting a huge set of neurons to the input, it is common to organize them in groups of stacked layers that transform the outputs of the previous layer and feed it to the next layer. This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: see above), Fig. 4: 
    PNG
    media_image12.png
    290
    505
    media_image12.png
    Greyscale
]). 
determining one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using one or more neural network high-level semantic decoders comprises (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. These high-level feature maps are processed by a decoder to produce segmentation maps. [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: see above]).  
processing the common trunk feature map using the one or more neural network high- level semantic decoders. (Examiner notes that high-level semantic decoder is interpreted as a decoder that processes high-level feature maps. Examiner notes that “common trunk feature map” is interpreted as a feature map resulting from a layer of the decoder (see above for further detail). Maggiori teaches an encoder-decoder convolutional neural network. The lower and upper layers result in low-level and high-level feature maps, respectively. Corresponding layers of the decoder process the feature maps to generate a new feature map. The resulting decoder feature maps are then processed by the next layer of the decoder. [Maggiori sec. III B. ¶1-2: “a more advanced approach is to attach a multi-layer network to learn a complex upsampling function. This idea was simultaneously presented by different research groups [6], […] The convolutional layers are reflected as deconvolutional layers, and the pooling layers as unpooling layers (see Fig. 4). […] This concept can be thought of as an “encoder–decoder”, where the middle layer is seen as a common representation to images and classification maps, while the “encoder” and “decoder” ensure the translation between this representation and the two modalities”, sec B ¶4: “We also created a deconvolution network that exactly reflects the base FCN (as in [6]). This is straightforward, with deconvolutional and unpooling layers associated to every convolutional and pooling layer.”, sec. II. A. ¶2: “Multiple convolution kernels are usually learned in every layer, interpreted as a set of spatial feature detectors. The responses to every learned filter are thus referred to as feature maps.”, sec. II. C. ¶4: “This gave birth to the so-called deconvolutional (or upconvolutional) layer, which upsamples a feature map by interpolating neighboring elements (as the last layer in Fig. 1).”, sec II ¶4: “Instead of directly connecting a huge set of neurons to the input, it is common to organize them in groups of stacked layers that transform the outputs of the previous layer and feed it to the next layer. This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: see above), Fig. 4: 
    PNG
    media_image12.png
    290
    505
    media_image12.png
    Greyscale
]).  
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Regarding Claim 14:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 4” as seen above.
Maggiori further teaches: 
generate [the fillable digital form] based on the determined one or more low-level semantic characteristics and the determined one or more high-level semantic characteristics by classifying each pixel of the plurality of pixels as one of a plurality of element types. (Examiner notes that low-level and high-level semantic characteristics are interpreted as low-level and high-level features, respectively. Maggiori teaches semantic labeling which is the classification of every pixel in an image. To do this, Maggiori teaches using encoder-decoder convolutional neural network (i.e. the trained form conversion neural network) to learn hierarchical features of an image: low level and high level. Maggiori teaches generating a segmentation digital map based on those low-level and high-level features. In an example shown in Fig. 8, pixels are classified as elements such as building, tree, car, etc. [Maggiori Abstract: “In this paper we address the problem of dense semantic labeling, which consists in assigning a semantic label to every pixel in an image. […] Out of these observations, we then derive a CNN framework specifically adapted to the semantic labeling problem. In addition to learning features at different resolutions, it learns how to combine these features.”, sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. IV. ¶9: “The proposed technique is intended to learn how to combine information at different resolutions, not how to upsample a low-resolution classification. An example of the type of relation we are able to convey in this scheme is as follows: “label a pixel as building if it is red and belongs to a larger red rectangular structure, which is surrounded by areas of green vegetation and near a road”.”, Fig. 8: 

    PNG
    media_image13.png
    308
    1011
    media_image13.png
    Greyscale
]).

Ajward further teaches:
generate the fillable digital form based on the determined one or more [low-level] semantic characteristics and [the determined one or more high-level] semantic characteristics by classifying [each pixel of the plurality of pixels] as one of a plurality of element types. (Ajward teaches a method for converting scanned forms into editable text (i.e. fillable digital forms). Characters of text are examples of semantic characteristics. To convert to editable text, Ajward teaches classifying each connected component as Sinhala characters. As seen above, the determined low-level/high-level semantic characteristics and classifying each pixel were taught by Maggiori and this limitation is taught by the combination of Maggiori and Ajward.  [Ajward sec. 1 ¶1: “The typical process of digitizing text document is performed by scanning the printed copies to images and converting them to editable text”, sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, sec. III. B. ¶2-4: “Neural Network was trained for characters obtained from the pre-processed image. […] The designed neural network was trained for ten input alphabets to get higher accuracy trained system.”]).  
Simantov further teaches:
wherein the instructions, when executed by the at least one processor, cause the computing device to (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori with the teachings of Ajward and Simantov for at least the same reasons as discussed above in claim 4.

Claims 6-7 are rejected under 35 U.S.C. 103 as being unpatentable over Ajward in view of Simantov, Maggiori, and Zhang. 

Regarding Claim 6:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 5” as seen above.
Ajward further teaches:
process the feature map [using a neural network reconstruction decoder] to generate a reconstructed layout corresponding to the digitized paper form (Ajward teaches decoding a feature map to reconstruct the digitized paper document. [Ajward sec. V. ¶1: “The text file with recognized characters and the encoded file with extracted features are used to generate an html file mapped to the original scanned document as shown in Fig 8. In this html file, the encoded features are decoded and applied to the corresponding recognized characters. The html file can be loaded to the editor and can be converted to RTF file which facilitate any advance modifications.”, sec. I. ¶8: “The recognized characters (the outcome of phase one) are embedded with identified features (the outcome of phase two) to reconstruct the original document in Rich Text Format (RTF [10]) format in an editor.
wherein, the instructions, when executed by the at least one processor, cause the computing device to generate the fillable digital form further based on the reconstructed layout. (Ajward teaches generating the fillable digital form by combining the reconstructed layout and recognized text. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, Fig. 1: 
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 
Simantov further teaches: 
further comprising instructions that, when executed by the at least one processor, cause the computing device to (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
Ajward, Simantov, and Maggiori do not explicitly teach “process the feature map using a neural network reconstruction decoder to generate a reconstructed layout corresponding to the digitized paper form.”
Zhang teaches a method for classification using neural networks. Zhang teaches:
process the feature map using a neural network reconstruction decoder to generate a reconstructed layout corresponding to the digitized paper [form] (Examiner notes that digitized paper is merely an image. Examiner further notes that a layout is the way in which parts of something are arranged and through reconstructing the image, the layout of the image will also be reconstructed. Zhang teaches augmenting an encoder-decoder convolutional neural network (i.e. a conversion neural network) with an auxiliary decoding pathway used for reconstruction of an input image. [Zhang sec. 1 ¶4: “we augment challenge-winning neural networks with decoding pathways for reconstruction, demonstrating the feasibility of improving high-capacity networks for largescale image classification. Specifically, we take a segment of the classification network as the encoder and use the mirrored architecture as the decoding pathway to build several autoencoder variants.”, sec. 3.2 para 3: “The auxiliary training signals of SAE-first emerge from the bottom of the decoding pathway, and they get merged with the top-down signals for classification at the last convolution-pooling macro-layer into the encoder pathway.”, Fig. 5: 
    PNG
    media_image4.png
    122
    643
    media_image4.png
    Greyscale
]).   
The combined system of Maggiori, Ajward, and Simantov; Zhang; and the instant application are analogous art because they are all directed to using neural networks for classification.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the document digitization method disclosed by Maggiori in view of Ajward and Simantov to include convolutional neural network reconstruction decoder as taught by Zhang. One would be motivated to do so to improve performance by reducing error, as suggested by Zhang (Zhang sec. 5 ¶1: “We proposed a simple and effective way to incorporate unsupervised objectives into large-scale classification network learning by augmenting the existing network with reconstructive decoding pathways. […] This method improved the performance of the 16-layer VGGNet, one of the best existing networks for image classification by a noticeable margin.”, sec. 4.3 ¶6: “In particular, compared to the VGGNet baseline, the SWWAE-all model reduced the top-1 errors by 1.66% and 1.18% for the single-crop and convolution schemes, respectively. It also reduced the top-5 errors by 1.01% and 0.81%, which are 10% and 9% relative to the baseline errors.”).

Regarding Claim 7:
Maggiori in view of Ajward and Simantov teach “The non-transitory computer readable storage medium of claim 6” as seen above.
Maggiori further teaches:
[generate the fillable digital form by] combining the low-level semantic segmentation map, the one or more high-level semantic segmentation maps, [and the reconstructed layout] (Maggiori teaches a segmentation map based on the low- and high-level features. [Maggiori sec II ¶4: “This enforces the networks to learn hierarchical features, performing low-level reasoning in the first layers (such as edge detection) and higher-level tasks in the last layers (e.g., assembling object parts). For this reason, the first and last layers are often referred to as lower and upper layers, respectively.”, sec. III. C. ¶5: “For example, it combines how a layer evaluates that an object is a building by using low-level information, with how another layer evaluates whether the same object is a building by using higher-level information.”, Fig. 9: 

    PNG
    media_image1.png
    406
    861
    media_image1.png
    Greyscale
]).

Ajward further teaches:
generate the fillable digital form by combining the [low-level semantic segmentation map, the one or more high-level] semantic [segmentation] maps, and the reconstructed layout. (Ajward teaches generating the fillable digital form by combining the reconstructed layout and mapped semantic characteristics. As seen above, the determined low-level/high-level semantic characteristics and segmentation maps were taught by Maggiori and this limitation is taught by the combination of Maggiori and Ajward. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, sec. III. B para 5: “When simulating, each and every character in the image file is recognized and written to a text file. The resulted text file consists with mapped English characters and converted in to Sinhala at the final stage.”, Fig. 3: “Example for mapping between inputs and target of the neural network”, Fig. 1: 
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 
Simantov further teaches: 
wherein the instructions, when executed by the at least one processor, cause the computing device to (Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
	Zhang further teaches: 
[generate the fillable digital form by] combining [the low-level semantic segmentation map, the one or more high-level semantic] segmentation maps, and the reconstructed layout (Examiner further notes that a layout is the way in which parts of something are arranged and through reconstructing the image, the layout of the image will also be reconstructed. Zhang teaches classification using an encoder decoder convolutional neural network with an auxiliary reconstruction decoder. As seen above, the low-level/high-level segmentation maps were taught by Maggiori and the generation of a fillable digital form was taught by Zhang and the combination of Maggiori, Zhang, and Ajward teach this limitation. [Zhang sec I. ¶5: “Specifically, we take a segment of the classification network as the encoder and use the mirrored architecture as the decoding pathway”, sec. 5 ¶1: “We proposed a simple and effective way to incorporate unsupervised objectives into large-scale classification network learning by augmenting the existing network with reconstructive decoding pathways.”, sec. 3.2 para 3: “The auxiliary training signals of SAE-first emerge from the bottom of the decoding pathway, and they get merged with the top-down signals for classification at the last convolution-pooling macro-layer into the encoder pathway.”, Fig. 5: 
    PNG
    media_image4.png
    122
    643
    media_image4.png
    Greyscale
]).  
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Maggiori in view of Ajward and Simantov with the teachings of Zhang for at least the same reasons as discussed above in claim 6.

Claims 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Simantov in view of Ajward and Zhang. 

Regarding Claim 15:
Simantov teaches a system to determine semantic information from scanned documents using convolutional neural networks. Simantov teaches:
a memory component comprising: (The “memory component” is interpreted to be memory used for storing data, metadata, and programs for execution by the processor as per ¶127 of the specification. Simantov teaches non-volatile storage storing instructions that are executed by a processor to perform tasks. [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
analytics training data associated with a plurality of pixels of training digitized paper forms; (Examiner notes that “analytics training data” is interpreted to be training data. Simantov teaches training data comprised of images of invoices (i.e. digitized paper forms). [Simantov col 17 lines 38-44: “We train our model using a mix of supervised and unsupervised techniques, with several innovations to control over- and under-fitting, as well as the sparse loss function. Since datasets for receipt image analysis are in a dire short supply, we built our own proprietary dataset that consists of 5,094 images of invoices with 23,013 human-tagged bits of information.”]). 
and a neural network encoder that outputs feature maps to [a form conversion neural network reconstruction decoder,] a neural network low-level semantic decoder, and one or more neural network high-level semantic decoders; (Examiner notes that a “form conversion neural network” is interpreted as a convolutional neural network as per ¶38 of the instant specifications. Examiner notes that low-level and high-level semantic decoder is interpreted as a decoder that processes low-level and high-level feature maps, respectively. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al., the low-level feature maps from the encoder are processed by the decoder. [Simantov col 21 lines 58-62: “there is provided a general schematic block diagram representing one possible system architecture, which is generally indicated at 1000, of an encoder-decoder patterned convolutional network (U-Net)”, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale
,
Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”]).  
at least one server; (Simantov teaches a server. [Simantov col 9 lines 36-37: “Optionally, the computing server hosting the expense management system is protected by a firewall 115.”]). 
and at least one non-transitory computer readable storage medium storing instructions thereon that, when executed by the at least one server, cause the system to: (Simantov teaches non-volatile storage storing instructions that are executed by a distributed computing system (which are comprised of servers). [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
generate a feature map by processing a digitized paper form comprising a plurality of pixels using the neural network encoder; (Simantov teaches using a U-Net to process digitized receipts (i.e. digitized paper forms comprising of pixels) to determine semantic characteristics. U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. As per Ronneberger Fig. 1, it can be seen that the layers of encoder (seen on the left side of the U) generate feature maps (shown as the blue boxes in Fig. 1). [Simantov col 1 lines 47-49: “FIG. 10 is a schematic block diagram representing one possible system architecture of an encoder-decoder patterned convolutional network (U-Net), according to one embodiment of the current disclosure”, col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Fig. 10: 
    PNG
    media_image10.png
    396
    371
    media_image10.png
    Greyscale
 
Ronneberger Fig. 1:  
    PNG
    media_image14.png
    609
    728
    media_image14.png
    Greyscale
]). 
determine one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using the neural network low-level semantic decoder; (Examiner notes that low-level semantic characteristics are interpreted as low-level features and low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per Ronneberger and “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al., the low-level feature maps from the encoder are processed by the decoder. [Simantov Fig. 10: see above, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale
 , Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger Fig. 1: see above]). 
and determine one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using the one or more neural network high-level semantic decoders. (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes high-level feature maps. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per Ronneberger, the high-level feature maps from the encoder are processed by the decoder. [Simantov Fig. 10: see above, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale
 , Ronneberger Fig. 1: see above]). 
Simantov does not teach “A system for converting paper forms into corresponding fillable digital forms, comprising:” or “and a neural network encoder that outputs feature maps to a form conversion neural network reconstruction decoder, a neural network low-level semantic decoder, and one or more neural network high-level semantic decoders;”
Ajward teaches a method of digitizing printed documents to editable text based on combining optical character recognition (OCR) and layout reconstruction. Ajward teaches: 
A system for converting paper forms into corresponding fillable digital forms, comprising: (Ajward teaches a method for converting scanned forms into editable text (i.e. fillable digital forms). [Ajward sec. 1 ¶1: “The typical process of digitizing text document is performed by scanning the printed copies to images and converting them to editable text”, sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”]). 

It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the encoder-decoder CNN disclosed by Maggiori to include conversion from paper forms to fillable digital forms as taught by Ajward. One would be motivated to do so to help users easily modify printed documented and word-search in printed documents, as suggested by Ajward  (Ajward Abstract: “ Digitization of text not only allows users to easily modify and reprint printed documents, but also is a need of the day due to the use of word-search capability available at disposal in this era.”). 
Ajward does not teach “and a neural network encoder that outputs feature maps to a form conversion neural network reconstruction decoder, a neural network low-level semantic decoder, and one or more neural network high-level semantic decoders;”
Zhang teaches a method for classification using neural networks. Zhang teaches:
and a neural network encoder that outputs feature maps to a form conversion neural network reconstruction decoder, a neural network low-level semantic decoder, and one or more neural network high-level semantic decoders; (Examiner notes that a “form conversion neural network” is interpreted as a convolutional neural network as per ¶38 of the instant specifications. Zhang teaches a reconstruction decoder. Zhang teaches augmenting an encoder-decoder convolutional neural network (i.e. a conversion neural network) with an auxiliary decoding pathway used for reconstruction. Examiner notes that low-level and high-level semantic decoder is interpreted as a decoder that processes low-level and high-level feature maps, respectively. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in CNNs and low-level features are extracted in earlier layers. Thus, Zhang teaches a decoder that processes low-level and high-level features. [Zhang sec. 1 ¶4: “we augment challenge-winning neural networks with decoding pathways for reconstruction, demonstrating the feasibility of improving high-capacity networks for largescale image classification. Specifically, we take a segment of the classification network as the encoder and use the mirrored architecture as the decoding pathway to build several autoencoder variants.”, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image15.png
    408
    710
    media_image15.png
    Greyscale
]). 
The combined system of Simantov and Ajward, Zhang, and the instant application are analogous art because they are all directed to using neural networks for classification.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the document digitization method disclosed by Simantov in view of Ajward to include convolutional neural network reconstruction decoder as taught by Zhang. One would be motivated to do so to improve performance by reducing error, as suggested by Zhang (Zhang sec. 5 ¶1: “We proposed a simple and effective way to incorporate unsupervised objectives into large-scale classification network learning by augmenting the existing network with reconstructive decoding pathways. […] This method improved the performance of the 16-layer VGGNet, one of the best existing 

Regarding Claim 16:
Simantov in view Ajward and Zhang teach “The system of claim 15” as seen above. 
Simantov further teaches:
wherein the instructions, when executed by the at least one server, cause the system to: (Simantov teaches non-volatile storage storing instructions that are executed by a distributed computing system (which are comprised of servers). [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
generate the feature map by processing the digitized paper form comprising the plurality of pixels using the neural network encoder by generating a training feature map by processing the training digitized paper form using the neural network encoder; (Simantov teaches using a U-Net to process digitized receipts to determine semantic characteristics. U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. As per Ronneberger Fig. 1, it can be seen that the layers of encoder (seen on the left side of the U) generate feature maps (shown as the blue boxes in Fig. 1). To use U-Net, U-Net is first trained. Training U-Net involves generating training feature maps by processing training data images. The encoder generates training feature maps.  [Ronneberger sec. 3 ¶1: “The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6].”, sec. 3 ¶5: “Ideally the initial weights should be adapted such that each feature map in the network has approximately unit variance.”, Fig. 1: 
    PNG
    media_image14.png
    609
    728
    media_image14.png
    Greyscale
]). 
determine one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using the neural network low-level semantic decoder by processing the training feature map using the neural network low-level semantic decoder and comparing a low- level output to a low-level ground truth to determine a low-level loss; [Examiner notes that “low-level semantic characteristics” is interpreted to be low level features. “Low-level semantic decoder” is interpreted to be a decoder that processes low-level feature maps.  “Low-level output” is interpreted to be an output based on low-level feature maps. “Low-level ground truth” is interpreted to a ground truth that shows at least low-level features such as edges. “Low-level loss” is interpreted to be loss that is based on at least low-level features. Simantov teaches using a U-Net, an encoder-decoder convolutional neural network.  U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al. and Fig 1 from Ronneberger et al., the low-level feature maps from the encoder are processed by the decoder layers, thus making it a low-level decoder. To use U-Net, U-Net is first trained. Training U-Net involves the decoder processing training feature maps from the encoder. The output of the decoder is compared with the ground truth (shown in Figure 3 and shown to show at least low-level features such as edges) to determine loss. [Simantov Fig. 10: see above, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig 7: see above, Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger sec. 3 ¶1: “The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6].”, sec. 3 ¶2: “The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function.”, Fig. 3:
    PNG
    media_image16.png
    241
    594
    media_image16.png
    Greyscale
]). 
determine one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using the one or more neural network high-level semantic decoders by processing the training feature map using the one or more neural network high-level semantic decoders and comparing one or more high-level outputs to a corresponding high-level ground truth to determine one or more high-level losses; [Examiner notes that “high-level semantic characteristics” is interpreted to be high level features. “high-level semantic decoder” is interpreted to be a decoder that processes high-level feature maps.  “high-level output” is interpreted to be an output based on high-level feature maps. “High-level ground truth” is interpreted to a ground truth that shows at least high-level features. “high-level loss” is interpreted to be loss that is based on at least high-level features. Simantov teaches using a U-Net, an encoder-decoder convolutional neural network.  U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al. and Fig 1 from Ronneberger et al., the high-level feature maps from the encoder are processed by the decoder layers, thus making it a high-level decoder. To use U-Net, U-Net is first trained. Training U-Net involves the decoder processing training feature maps from the encoder. The output of the decoder is compared with the ground truth (shown in Figure 3 and shown to show at least high-level features) to determine loss. [Simantov Fig. 10: see above, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig 7: see above, Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger sec. 3 ¶1: “The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6].”, sec. 3 ¶2: “The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function.”, Fig. 3:
    PNG
    media_image16.png
    241
    594
    media_image16.png
    Greyscale
]). 
generate a combined loss comprising the low-level loss and the one or more high-level losses; (Examiner notes that “low-level loss” and “high-level loss” is interpreted to be loss that depends on at least low-level features and high-level features, respectively. Simantov teaches U-Net, an encoder-decoder CNN. As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. Thus, loss calculated per pixel taught by Simantov would be based on both high- and low-level features. Simantov teaches a combining the losses between classification result and ground truth for all the pixels. [Simantov col 22 lines 15-20:“The U-Net network architecture 1000 may be trained using a softmax cross-entropy loss over the per-pixel field-class prediction.”, Ronneberger sec 3 ¶2: “The cross entropy then penalizes at each position the deviation of pl(x)(x) from 1 using                                 
                                    E
                                    =
                                     
                                    
                                        
                                            ∑
                                            
                                                x
                                                ∈
                                                Ω
                                            
                                        
                                        
                                            w
                                            
                                                
                                                    x
                                                
                                            
                                            l
                                            o
                                            g
                                            ⁡
                                            (
                                        
                                    
                                    
                                        
                                            p
                                        
                                        
                                            l
                                            
                                                
                                                    x
                                                
                                            
                                        
                                    
                                    (
                                    x
                                    )
                                    )
                                
                            ”]).
and [back propagate] the combined loss to modify parameters of the neural network low-level semantic decoder, the one or more neural network high-level semantic decoders, and the neural network encoder. (Simantov teaches using the loss to modify the shared weights (i.e. parameters) of the encoder and decoder. [Simantov col 22 lines 3-11: “The system architecture 1000 consists of a set of repeated convolutional and max pooling steps, followed by a set of upscaling steps with transpose convolutions that share weights with their convolution counterparts. The U-Net's output (a set of 35 features per pixel) from the U-Net component 1012 is passed through a 1×1 fully convolutional layer that outputs a set of nout predictions per pixel, where nout is the number of output classes (the possible field labels).”, col 22 lines 15-20:“The U-Net network architecture 1000 may be trained using a softmax cross-entropy loss over the per-pixel field-class prediction.”, col 22 lines 24-27: “The U-Net network architecture 1000 is further trained using a softmax cross-entropy loss over the per-pixel field-class prediction, where contribution of each pixel to the loss function is further weighted by its correct class.”]).

and back propagate the combined loss to modify parameters of the neural network [low-level semantic decoder, the one or more neural network high-level semantic decoders, and the neural network encoder]. (Ajward using back propagation to adjust weights (i.e. parameters) of the neural network. As seen above, the determined low-level/high-level semantic characteristics and encoder-decoder architecture were taught by Simantov and this limitation is taught by the combination of Simantov and Ajward. [Ajward sec. III. B ¶1: “Using back-propagation neural network errors can be propagated backward through the network to control weight adjustment and by the feed forward information moves in only one direction. So the result could be obtained efficiently and with higher accuracy.”]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Simantov with the teachings of Ajward and Zhang for at least the same reasons as discussed above in claim 15.

Regarding Claim 17:
Simantov in view Ajward and Zhang teach “The system of claim 16” as seen above. 
Simantov further teaches: 
wherein the low-level output comprises a low-level semantic segmentation map (Examiner notes that low-level output is interpreted to be an output that is based on at least low-level features and low-level semantic segmentation map is a segmentation map that shows at least low-level features such as edges. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. (herein thereafter Ronneberger) and is the architecture used by Simantov. The output of U-Net is a segmentation map with an example shown in Fig. 3 of Ronneberger. The segmentation map shows edges which are low-level features. [Simantov col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Ronneberger Fig. 3: 
    PNG
    media_image16.png
    241
    594
    media_image16.png
    Greyscale
]). 
and the one or more high-level outputs comprise one or more high- level semantic segmentation maps. (Examiner notes that high-level output is interpreted to be an output that is based on at least high-level features and high-level semantic segmentation map is a segmentation map that shows at least high-level features. Simantov teaches using a U-Net which is an encoder decoder convolutional neural network (CNN). U-Nets were first introduced and described in “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al. (herein thereafter Ronneberger) and is the architecture used by Simantov. The output of U-Net is a segmentation map with an example shown in Fig. 3 of Ronneberger. The segmentation map shows objects (the cells) which are high-level features. [Simantov col 21 lines 66-67: “a U - Net comprising an encoder 1014 and a decoder 1016;”, Ronneberger Fig. 3: 
    PNG
    media_image16.png
    241
    594
    media_image16.png
    Greyscale
]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Simantov with the teachings of Ajward and Zhang for at least the same reasons as discussed above in claim 15.

Regarding Claim 18:
Simantov in view Ajward and Zhang teach “The system of claim 16” as seen above. 
 	Simantov further teaches:
further comprising instructions that, when executed by the at least one server, cause the system to (Simantov teaches non-volatile storage storing instructions that are executed by a distributed computing system (which are comprised of servers). [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
Zhang teaches:
process the training feature map using the neural network reconstruction decoder and comparing a reconstruction output to a reconstruction ground truth to determine a reconstruction loss, (Zhang teaches a reconstruction loss calculated by taking the difference between the reconstruction output and the original training image. The training feature map is processed by the decoder to obtain the reconstruction output.  [Zhang sec. 4.2 ¶7: “Assuming the ability of preserving information as a helpful property for deep neural network, we took the reconstruction loss as an auxiliary objective function for training the classification network”,  sec 3.2 ¶3: “we propose the “SAE-all” model by replacing the unsupervised loss by                                 
                                    
                                        
                                            U
                                        
                                        
                                            S
                                            A
                                            E
                                            -
                                            a
                                            l
                                            l
                                        
                                    
                                    
                                        
                                            x
                                        
                                    
                                    =
                                     
                                    
                                        
                                            ∑
                                            
                                                l
                                                =
                                                0
                                            
                                            
                                                L
                                                -
                                                1
                                            
                                        
                                        
                                            γ
                                            |
                                            
                                                
                                                    
                                                        
                                                            
                                                                
                                                                    a
                                                                
                                                                
                                                                    l
                                                                
                                                            
                                                        
                                                        ^
                                                    
                                                    -
                                                    
                                                        
                                                            a
                                                        
                                                        
                                                            l
                                                        
                                                    
                                                
                                            
                                            
                                                
                                                    |
                                                
                                                
                                                    2
                                                
                                                
                                                    2
                                                
                                            
                                        
                                    
                                
                             ”]). 
wherein the combined loss further comprises the reconstruction loss and the instructions, (Zhang teaches the combined loss function to be the classification loss combined with the reconstruction loss. [Zhang sec. 4.2 ¶7: “Assuming the ability of preserving information as a helpful property for deep neural network, we took the reconstruction loss as an auxiliary objective function for training the classification network”, sec. 3.1 ¶3: “A solution to both problems is to incorporate auxiliary unsupervised training objectives to the intermediate layers. More specifically, the objective function becomes                                 
                                    
                                        
                                            1
                                        
                                        
                                            N
                                        
                                    
                                    
                                        
                                            ∑
                                            
                                                i
                                                =
                                                1
                                            
                                            
                                                N
                                            
                                        
                                        
                                            (
                                            C
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                    ,
                                                    
                                                        
                                                            y
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            +
                                            λ
                                            U
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            )
                                        
                                    
                                
                             where                                 
                                    U
                                    (
                                    ∙
                                    )
                                
                             is the unsupervised objective function associating with one or more auxiliary pathways that are attached to the convolution-pooling macro-layers in the original classification network.”]). 
when executed by the at least one server, [back propagate] the combined loss to further modify parameters of the reconstruction decoder. (Zhang teaches using the combined loss to modify the weights (i.e. parameters) of the reconstruction decoder. [Zhang sec. 4.1 ¶2-6: “1. We initialized the encoding pathway with the pretrained classification network, and the decoding pathways with Gaussian random initialization. 2. For any variant of the augmented network, we fixed the parameters for the classification pathway and trained the layer-wise decoding pathways of the SAElayerwise network. […] Up to Step 3, we trained the decoding pathways with the classification pathway fixed. For all the four steps, we trained the networks by mini-batch stochastic gradient descent (SGD) with the momentum 0:9.”]). 
Ajward further teaches:
when executed by the at least one server, back propagate the [combined] loss to further modify parameters [of the reconstruction decoder]. (Ajward teaches back propagation to modify parameters of the neural network. As seen above, the combined loss and reconstruction decoder were taught by Zhang and this limitation is taught by the combination of Zhang and Ajward. [Ajward sec. III. B ¶1: “Using back-propagation neural network errors can be propagated backward through the network to control weight adjustment and by the feed forward information moves in only one direction. So the result could be obtained efficiently and with higher accuracy.”]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Simantov with the teachings of Ajward and Zhang for at least the same reasons as discussed above in claim 15.

Regarding Claim 19:
Simantov in view Ajward and Zhang teach “The system of claim 15” 
Simantov further teaches:
wherein the instructions, when executed by the at least one server, cause the system to: (Simantov teaches non-volatile storage storing instructions that are executed by a distributed computing system (which are comprised of servers). [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
determine one or more low-level semantic characteristics for the plurality of pixels by processing the feature map using a trained neural network low-level semantic decoder; (Examiner notes that low-level semantic characteristics are interpreted as low-level features and low-level semantic decoder is interpreted as a decoder that processes low-level feature maps. Simantov teaches using a trained U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per Ronneberger and “Gated Convolutional Neural Network for Semantic Segmentation in High-Resolution Images” by Wang et al., the low-level feature maps from the encoder are processed by the decoder. [Simantov Fig. 10: see above, col 22 lines 15-20: “The U-Net network architecture 1000 may be trained using a softmax cross-entropy loss over the per-pixel field-class prediction.”, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale
 , Wang et al. sec. 1 ¶3: “For example, U-Net [13] modifies and extends the FCN by introducing concatenation structures between the corresponding encoder and decoder layers. The concatenation structure enables the decoder layers to reuse low-level feature maps with more details to achieve a more precise pixel-wise classification.”, Ronneberger Fig. 1: see above]). 
and determine one or more high-level semantic characteristics for the plurality of pixels by processing the feature map using one or more trained neural network high-level semantic decoders. (Examiner notes that high-level semantic characteristics are interpreted as high-level features and high-level semantic decoder is interpreted as a decoder that processes high-level feature maps. Simantov teaches using a trained U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. As per Ronneberger, the high-level feature maps from the encoder are processed by the decoder. [Simantov Fig. 10: see above, col 22 lines 15-20: “The U-Net network architecture 1000 may be trained using a softmax cross-entropy loss over the per-pixel field-class prediction.”, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: see above, Ronneberger Fig. 1: see above]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Simantov with the teachings of Ajward and Zhang for at least the same reasons as discussed above in claim 15.

Regarding Claim 20:
Simantov in view Ajward and Zhang teach “The system of claim 15”
Simantov further teaches:
further comprising instructions that, when executed by the at least one server, cause the system to (Simantov teaches non-volatile storage storing instructions that are executed by a distributed computing system (which are comprised of servers). [Simantov col 6 lines 6-15: “In various embodiments of the disclosure, one or more tasks as described herein may be performed by a data processor, such as a computing platform or distributed computing system for executing a plurality of instructions. Optionally, the data processor includes or accesses a volatile memory for storing instructions, data or the like. Additionally, or alternatively, the data processor may access a non-volatile storage, for example, a magnetic hard-disk, flash-drive, removable media or the like, for storing instructions and/or data.”]). 
[generate a fillable digital form] based on the determined one or more low-level semantic characteristics and the determined one or more high-level semantic characteristics. (Examiner notes that low-level and high-level semantic characteristics are interpreted as low-level and high-level features, respectively. Simantov teaches determining low-level and high-level features through using a U-Net which is an encoder decoder convolutional neural network (CNN). As per “A New Era for Feature Extraction in Remotely Sensed Images by The Use of Machine Learning” by Orstavik et al., the deeper into the network, higher level features (i.e. semantic characteristics) are extracted later in the CNN and low-level features are extracted in earlier layers. [Simantov Fig. 10: see above, col 22 lines 15-20: “The U-Net network architecture 1000 may be trained using a softmax cross-entropy loss over the per-pixel field-class prediction.”, Orstavik et al. sec. 2.2 ¶3: “The convolution layer is the core building block of CNNs, and contains filters. […] Each of the filters look for certain features in the image. Such features may, for example, be a curve, an edge or a feature of a specific color. Higher level filters (filters deeper into the network) look for combinations of these simpler features. The deeper into the network, the more complex the features become (Figure 7).”, Fig. 7: 
    PNG
    media_image7.png
    440
    765
    media_image7.png
    Greyscale
]). 
Ajward further teaches:
 generate a fillable digital form based on the determined one or more [low-level] semantic characteristics and [the determined one or more high-level] semantic characteristics. (Ajward teaches generating editable text (i.e. a fillable digital form) based on the semantic characteristics of digitized paper forms. As seen above, the determined low-level/high-level semantic characteristics were taught by Simantov and this limitation is taught by the combination of Simantov and Ajward. [Ajward sec. 1 ¶5: “As depicted in Fig 1, the project could be divided into two phases, (1) character recognition using an OCR technique and (2) extracting and preserving the layout (formatting) information of the document. As illustrated, the outcome, an editable Sinhala document that preserve formatting could be achieved by integrating the outcome of phases one and two.”, Fig. 1: 
    PNG
    media_image3.png
    462
    507
    media_image3.png
    Greyscale
]). 
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Simantov with the teachings of Ajward and Zhang for at least the same reasons as discussed above in claim 15.

Prior Art of Record
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
“Fully Convolutional Neural Networks for Page Segmentation of Historical Document Images” by Wick et al. teaches using a convolutional neural network comprising of an encoder and decoder to segment documents by classifying every pixel of an image (Wick et al. sec. I ¶2-3: “The basic approach 
“DeepSIC: Deep Semantic Image Compression” by Luo et al. teaches an encoder-decoder neural network in which the encoder generates feature maps and the decoder consists of two branches: image reconstruction and semantic analysis (Luo et al. Abstract: “In this paper, we propose a concept called Deep Semantic Image Compression (DeepSIC) and put forward two novel architectures that aim to reconstruct the compressed image and generate corresponding semantic representations at the same time.”, Fig. 1, Fig. 2).  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Somie Park whose telephone number is (571)272-1056. The examiner can normally be reached 9:00am - 5:00pm, Monday-Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571)272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and 





/SOMIE PARK/Examiner, Art Unit 2126     
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126