DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed 12/14/2020 has been entered. Claims 1, 2, 4-17 and 22-24 remain pending in the application. 

Response to Arguments
Applicant’s arguments, filed 12/14/2020, with respect to the rejections of claims 1 and 22 under 103 have been fully considered and are not persuasive.  Applicant’s arguments with respect to the rejections of claims 15 and 24 under 103 have been fully considered and are persuasive because of the amendments. Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of over McKinney et al. (US Patent 9,411,790) in view of Forman et al. (US Pub. 2004/0024769) and further in view of Filimonova (US Pub. 2016/0307067).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 4-7, 9-14 and 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over McKinney et al. (US Patent 9,411,790) in view of Forman et al. (US Pub. 2004/0024769).
As per claim 1, McKinney teaches a document structure extraction method [abstract, method for generating structured documents] comprising: 
receiving, by a document structure analytics server, an untagged document [Col. 4, lines 41-52, receiving unstructured digital source content; Col. 4, lines 13-15, digital source content may include image files, XML files, portable document files (PDF), etc.] that comprises a plurality of document parts [Col. 4, lines 48-49, header and body text], wherein certain of the document parts have a visual appearance [Col. 4, lines 50-52, font size, font type, bold, etc.] that is defined by formatting information included in the untagged document [Col. 4, lines 41-52, parses the digital source content for source data elements, such as text, images, video, executable-code segments, and the like, “these source data elements may include one or more attributes that define or represent a function for a particular source data element with a structured document. For example, a function may include "header", which defines a set of characters which are typically placed on top of body text, the header text includes a font size that is larger than that of the body text, and may be bold or have a font type that is different than the body text”]; 
receiving, by the document structure analytics server, a command to generate a table of contents for the untagged document [Col. 2, lines 18-19, “a method for converting unstructured content into a structured document”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier … an identifier may include "body" or "header"”; Col. 13, lines 62-64, “The arrangement of the structured document, including constituent source elements being tagged with identifiers (table of contents)”; Fig. 4, Col. 2, line 20, “receiving unstructured digital source content”; It can be seen that when the system receives unstructured content, it will assign/tag each source data element in order to transform the unstructured content into a structured document, thus receiving the unstructured content is the same with receiving the command to generate a structured document (table of contents) for the unstructured content];  
paragraph 0034 in the specification of the Application recites “Each of tagged documents ... includes metadata that assigns a particular categorization to a particular document part. Such metadata may be provided in the form of a defined tagging structure, an existing table of contents … or any other construct that associates a particular categorization with a particular document part”.
paragraph 0041 in the specification of the Application recites “applying predictive model 264 to identify the various headings, sub-headings, and so forth in a given document, and then build an appropriate table of contents based on such identification.”.
in response to receiving the command to generate the table of contents [Col. 2, lines 18-19, “a method for converting unstructured content into a structured document”; Fig. 4, Col. 2, line 20, “receiving unstructured digital source content”], invoking a document tagging process that comprises: 
[Col. 6, lines 7-34, receiving the digital source content, extracting source data elements from the source content, for example, a set of word and the font size, type]; 
for each of two or more of the plurality of document parts [Col. 4, lines 48-49, header and body text], generating one or more feature-value pairs [Col. 6, lines 29-34, font size is 40 percent larger (font is a feature, size or type is a value)] using the extracted formatting information [Col. 6, line 34, a set of word], wherein each of the generated feature-value pairs characterizes the visual appearance of the corresponding document part by associating a particular value with a particular formatting feature [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”]; Page 2 of 19
using the particular predictive model [Col. 6, lines 34-35, the assignment module 140] to predict a categorization for each of the two or more document parts [Col. 1, lines 27-28, title text and body text] that form part of the untagged document based on the corresponding one or more feature-value pairs [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier. More specifically, the assignment module 140 may compare a weighting of attributes for a source data element to a plurality of identifiers that define specific content types or functions for an element within a page or a document. For example, an identifier may include "body" or "header" where body text is typically smaller in size relative to header text”], wherein the particular predictive model “…” make predictions based on a collection of [Col. 6, lines 29-34, font is a feature, size or type is a value] aggregated from, and characterizing document parts included in, the corpus of tagged training documents [Col. 6, lines 32-38, “knowing these attributes (feature and value) the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text". The assignment module 140 tags this element based upon comparisons (comparing the attributes) between the source data element and previously tagged source data elements in a training corpus”; Col. 11, lines 61-67, “Each time a source container is processed by the system 105 to extract, process, and tag source data elements with identifiers, the tagged source data elements may be stored in the database 150 in a training corpus”]; and
defining tag metadata that associates each of the two or more document parts with the corresponding predicted categorization generated by the particular predictive model [col. 4, lines 53-61, determine the attributes of all source data elements in the source container and tag each of these elements with their corresponding attributes; Col. 6, lines 32-34, “the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 5, lines 64-65, “an identifier may also herein be referred to as a classification for a source data element”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier … an identifier may include "body" or "header" where body text is typically smaller in size relative to header text”].  
generating the table of contents based on the defined tag metadata [Col. 2, lines 18-19, “a method for converting unstructured content into a structured document”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier … an identifier may include "body" or "header"”; Col. 13, lines 62-64, “The arrangement of the structured document, including constituent source elements being tagged with identifiers (table of contents)], wherein the table of contents correlates a document part identified as a heading by the particular predictive model with a location of the heading within the untagged document [Col. 12, lines 1-18, “each of the exemplary data elements preferably includes one or more attributes that define an identifier. Thus, by comparison, if the attributes of the source data element substantially match the attributes of an exemplary data element, the source data element may be tagged with the same identifier … if the source data element is compared to exemplary data elements that include a "header" identifier … If the size and location of the source data elements are identical, the system 105 may proceed with tagging the source data element with the header identifier”]; and  
modifying the untagged document to include the generated table of contents [Col. 2, lines 18-19, “converting unstructured content into a structured document”].  
McKinney does not teach
identifying a document type category to which the untagged document belongs; 
making a selection of a particular predictive model, from amongst a plurality of predictive models hosted by the document structure analytics server, wherein the selection is made based on the particular predictive model having been trained using a corpus of tagged training documents belonging to the identified document type category to which the untagged document belongs, and wherein each of the predictive models is configured to categorize document parts for documents sharing a common document type categorization for a respective predictive model; 
the predictive model applies a machine learning algorithm to make predictions.
Forman teaches
identifying a document type category to which the untagged document belongs [Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction," … If it is known that books from a particular technical publisher (e.g., Publisher X) are nonfiction (i.e., we have a prior that books from Publisher X belong in any of the non-fiction topics, but not the fiction topics) … Thus, when training the top categorizer 500 to choose between categories 504A and 504B, all of the books from Publisher X can be used as training examples for category 504B”]; 

    PNG
    media_image1.png
    528
    837
    media_image1.png
    Greyscale

making a selection of a particular predictive model [Fig. 5, sub-categorizer 502A], from amongst a plurality of predictive models [Fig. 5, sub-categorizer 502A, sub-categorizer 502B, …] hosted by the document structure analytics server [paragraph 0030, “processor 302 executes inducer module 306, which induces top-down hierarchical categorizer 308 based on training items (documents) with labels 402A (shown in FIG. 4) and based on training items with priors 402B (shown in FIG. 4), as described in further detail below. Processor 302 executes the induced hierarchical categorizer 308 to categorize a set of unlabeled items”], wherein the selection is made based on the particular predictive model having been trained using a corpus of tagged training documents belonging to the identified document type category to which the untagged document belongs [paragraph 0011, “Categorizers may be built manually by people authoring rules/heuristics, or else built automatically via machine learning, wherein categorizers are induced based on a large training set of items. Each item in the training set is typically labeled with its correct category assignment. The use of predefined categories implies a supervised learning approach to categorization, where already-categorized items are used as training data to build a model for categorizing new items. Appropriate labels can then be assigned automatically by the model to new, unlabeled items depending on which category they fall into”; Claim 8, “an inducer for inducing a plurality of categorizers corresponding to the plurality of categories, the inducer configured to induce each categorizer based on the features of labeled training items assigned to categories under that categorizer and based on the features of unlabeled training items with prior information representing category assignments that map to a category under that categorizer”; Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction"; It can be seen in Fig. 5 that when the document type falls in the 504A category (fiction), the categorizer 502A is selected to perform the further process, and when the document type falls in the 504B category (non-fiction), the categorizer 502B is selected; paragraph 0043, “during training of sub-categorizer 502A, training items 402A having labels corresponding to category 504A 11 or 504A-12 would be temporarily mapped to category 504A-1, and would be training examples for category 504A- 1. Similarly, training items 402A having labels corresponding to category 504A-21 or 504A-22 would be temporarily mapped to category 504A-2”; paragraph 0025, “training a categorizer 106 from a training set 102 of labeled records. Each record in training set 102 is labeled with its correct category assignment”; It can be understood that, based on the document type (in this case, “fiction”), a particular categorizer is selected (in this case, sub-categorizer 502A), and the selected sub-categorizer 502A is trained using the labeled examples associated with the document type (fiction category)]], and wherein each of the predictive models is configured to categorize document parts for documents sharing a common document type categorization [non-fiction]  for a respective predictive model [Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction", and that the categories 504 below category 504A specify fiction topics, and the categories below category 504B specify non-fiction topics. If it is known that books from a particular technical publisher (e.g., Publisher X) are nonfiction … when training the top categorizer 500 to choose between categories 504A and 504B, all of the books from Publisher X can be used as training examples for category 504B”; It can be seen in Fig. 5 that the Categorizer 502 B is selected to categorize the books if they are non-fiction; since Forman teaches “each of the predictive models is configured to categorize “…” documents sharing a common document type categorization for a respective predictive model”, and McKinney teaches the predictive model is configured to categorize document parts for documents [Col. 6, lines 29-35; Col. 1, lines 27-28; Col. 8, lines 2-9], therefore, the combination of McKinney and Forman read on the claim limitation]; 
the predictive model applies a machine learning algorithm to make predictions [paragraph 0025, “training a categorizer 106 from a training set 102 of labeled records. Each record in training set 102 is labeled with its correct category assignment. Inducer 104 receives the training set 102 and constructs the categorizer 106 using a machine learning algorithm.”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the process of identifying a document type category to which the untagged document belongs, and making a selection of a particular predictive model, wherein the selection is made based on the particular predictive model having been trained using a corpus of tagged training documents belonging to the identified document type category of Forman into the method of generating structured documents of McKinney. Doing so would help increasing the categorization accuracy of the categorizer with relatively minor added cost (Forman, 0081).

As per claim 2, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
one of the generated feature-value pairs associates a font size formatting feature with a particular font size value [Col. 6, lines 30-31, “a font size that is 40 percent larger than other text element on the same page”]. 

As per claim 4, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
the untagged document is received from a client computing device [Col. 5, lines 54-56, receive digital source content (source containers) from the client device]; and 
 Page 3 of 16 Reply to Office Action of 14 September 2020the method further comprises applying the tag metadata to the untagged document to produce a tagged document that includes the table of contents [Col. 6, lines 32-34, “the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 5, lines 56-65, an identifier may also herein be referred to as a classification for a source data element; Col. 2, lines 18-19, “converting unstructured content into a structured document”; Col. 13, lines 62-64, “The arrangement of the structured document, including constituent source elements being tagged with identifiers (table of contents)”], and sending the tagged document that includes the table of contents to the client computing device [abstract, generating a structured document from the tagged source data elements; Col. 5, lines 56-65, the interface 120 may be utilized to communicate structured (tagged) documents to a client device, users may view or edit these structured documents].

As per claim 5, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
one of the generated feature-value pairs [Col. 6, lines 29-34, font size is 40 percent larger (font is a feature, size or type is a value)] associates a font size formatting feature with a particular value that is selected from a group consisting of a largest font in the untagged document, an intermediate- sized font in the untagged document, and a smallest font in the untagged document [Col. 6, lines 29-34, “a set of words may include attributes such as: a font size that is 40 percent larger than other text element on the same page … knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; It can be understood that the font size for this set of words is largest font on the page]. 

As per claim 6, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
one of the generated feature-value pairs [Col. 6, lines 29-34, font size is 40 percent larger (font is a feature, size or type is a value)] associates a font size formatting feature with a particular value that is selected from a group consisting of a font size that is larger than a preceding paragraph, a font size that is smaller than the preceding paragraph, a font size that is larger than a following paragraph, and a font size that is smaller than the following paragraph [Col. 6, lines 29-34, “a set of words may include attributes such as: a font size that is 40 percent larger than other text element on the same page … knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; It can be understood that the font size for this set of words (title) is 40 percent larger than a following paragraph].

As per claim 7, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
Wherein the particular value defines the particular formatting feature in relation to a formatting feature for a second document part [Col. 4, lines 50-52, “the header text includes a font size that is larger than that of the body text, and may be bold or have a font type that is different than the body text”; Col. 6, lines 29-34, “font size that is 40 percent larger than other text element”].

As per claim 9, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
using the particular predictive model to determine a confidence level in the categorization for at least some of the two or more document parts that form part of the untagged document [Col. 7, lines 64-65, assign a weighting scale or score to each source data element; Col. 8, lines 32-35, “to be attributed or assigned the "header" identifier, a source data element may be required to have a weight or score a plurality of attributes that meets the expected value for the header identifier”; Col. 8, lines 1-9, “After a weighting or scoring has occurred for the various source data elements, the assignment module 140 may assign or tag each source data element with a particular identifier. More specifically, the assignment module 140 may compare a weighting of attributes for a source data element to a plurality of identifiers that define specific content types or functions for an element within a page or a document. For example, an identifier may include "body" or "header" where body text is typically smaller in size relative to header text”].

As per claim 10, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
receiving, from a document viewer executing on a client computing device, the plurality of document parts and the formatting information [Col. 13, lines 3-20, the tagging of source data elements may be overseen by an end user, an end user may be allowed to review, change, or delete an assigned identifier for a source data element, when the designation is updated, the source data element (document part) and corresponding attributes (formatting information) may be used in future analyses (by the assignment module) to correctly identify source data elements, the updated source data element may be stored in the training corpus in the database 150, the system 105 may utilize the user-defined source data elements to arrive at a suitable identifier for a source data element; It can be understood that the assignment module receives the updated information from the user (by accessing the training corpus) including the source data element and corresponding attributes to determine the identifier].

As per claim 11, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
receiving, by the document structure analytics server, a plurality of untagged documents from a document management system [Col. 5, lines 53-55, “system 105 to couple with a network … to receive digital source content (source containers) from various other systems or devices, such as the client device”; Col. 4, lines 10-15, convert unstructured source containers into structured documents, digital source content may include image files, XML files, portable document files (PDF), etc.]. 

As per claim 12, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
Embedding the tag metadata into the untagged document to produce a tagged document that also includes the table of contents [Col. 12, lines 1-6, “each of the exemplary data elements preferably includes one or more attributes that define an identifier. Thus, by comparison, if the attributes of the source data element substantially match the attributes of an exemplary data element, the source data element may be tagged with the same identifier”; Col. 6, lines 32-34, “the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 5, lines 56-65, an identifier may also herein be referred to as a classification for a source data element; Col. 2, lines 18-19, “converting unstructured content into a structured document”; Col. 13, lines 62-64, “The arrangement of the structured document, including constituent source elements being tagged with identifiers (table of contents)”].  

As per claim 13, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
Embedding the tag metadata into the untagged document to produce a tagged document that also includes the table of contents [Col. 12, lines 1-6, “each of the exemplary data elements preferably includes one or more attributes that define an identifier. Thus, by comparison, if the attributes of the source data element substantially match the attributes of an exemplary data element, the source data element may be tagged with the same identifier”; Col. 6, lines 32-34, “the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 5, lines 56-65, an identifier may also herein be referred to as a classification for a source data element; Col. 2, lines 18-19, “converting unstructured content into a structured document”; Col. 13, lines 62-64, “The arrangement of the structured document, including constituent source elements being tagged with identifiers (table of contents)”], and
sending the tagged document to a client computing device [abstract, generating a structured document from the tagged source data elements; Col. 5, lines 56-65, the interface 120 may be utilized to communicate structured documents (including tagged document) to a client device, users may view or edit these structured documents]. 

As per claim 14, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney further teaches
modifying the untagged document [Col. 12, lines 34-35, “the assignment module 140 may also adjust or modify the structure of a source data element”] such that the visual appearance of the particular document part is further defined by the predicted categorization generated by the particular predictive model [Col. 12, lines 34-55, the assignment module 140 may also adjust the order of source data elements based upon their attributes and determined identifiers, as well as a styling for each of the source data elements, the assignment module 140 may consolidate attributes or identifiers for elements that have similar styles, metadata, or any of the other described attributes; Col. 4, lines 47-52, for example, "header" defines a set of characters which are typically placed on top of body text … or the body text has a font type that is different than the header text].

As per claim 22, McKinney teaches a document structure analytics server [Fig. 5, server] that comprises a memory device and a processor that is operatively coupled to the memory device [Fig. 5, Col. 14, lines 35-37, server, a processor 510 and main memory 520], wherein the processor is [Fig. 5, Col. 14, lines 37-38, “Main memory 520 stores, in part, instructions and data for execution by processor 510”] that comprises: 
accessing, by a document structure analytics server, an untagged document [Col. 4, lines 41-52, receiving unstructured digital source content; Col. 4, lines 13-15, digital source content may include image files, XML files, portable document files (PDF), etc.] that comprises a plurality of document parts [Col. 4, lines 48-49, header and body text], wherein certain of the document parts have a visual appearance [Col. 4, lines 50-52, font size, font type, bold, etc.] that is defined by formatting information included in the untagged document [Col. 4, lines 41-52, parses the digital source content for source data elements, such as text, images, video, executable-code segments, and the like, “these source data elements may include one or more attributes that define or represent a function for a particular source data element with a structured document. For example, a function may include "header", which defines a set of characters which are typically placed on top of body text, the header text includes a font size that is larger than that of the body text, and may be bold or have a font type that is different than the body text”]; 
extracting at least a portion of the formatting information from the untagged document [Col. 6, lines 7-34, receiving the digital source content, extracting source data elements from the source content, for example, a set of word and the font size, type]; 
for each of two or more of the plurality of document parts [Col. 4, lines 48-49, header and body text], generating one or more feature-value pairs [Col. 6, lines 29-34, font size is 40 percent larger (font is a feature, size or type is a value)] using the extracted formatting information [Col. 6, line 34, a set of word], wherein each of the generated feature-value pairs characterizes the visual appearance of the corresponding document part by associating a particular value with a particular formatting feature [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”]; Page 2 of 19
using a predictive model [Col. 6, lines 34-35, the assignment module 140] to predict a categorization for each of the two or more document parts [Col. 1, lines 27-28, title text and body text] that form part of the untagged document based on the corresponding one or more feature-value pairs [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font. Knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier. More specifically, the assignment module 140 may compare a weighting of attributes for a source data element to a plurality of identifiers that define specific content types or functions for an element within a page or a document. For example, an identifier may include "body" or "header" where body text is typically smaller in size relative to header text”], wherein the predictive model “…” make predictions based on a collection of categorized feature-value pairs [Col. 6, lines 29-34, font is a feature, size or type is a value] aggregated from, and characterizing document parts included in, a corpus of tagged training documents [Col. 6, lines 32-38, “knowing these attributes (feature and value) the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text". The assignment module 140 tags this element based upon comparisons (comparing the attributes) between the source data element and previously tagged source data elements in a training corpus”; Col. 11, lines 61-67, “Each time a source container is processed by the system 105 to extract, process, and tag source data elements with identifiers, the tagged source data elements may be stored in the database 150 in a training corpus”];
[col. 4, lines 53-61, determine the attributes of all source data elements in the source container and tag each of these elements with their corresponding attributes; Col. 6, lines 32-34, “the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”; Col. 5, lines 64-65, “an identifier may also herein be referred to as a classification for a source data element”; Col. 8, lines 2-9, “the assignment module 140 may assign or tag each source data element with a particular identifier … an identifier may include "body" or "header" where body text is typically smaller in size relative to header text”]; and
sending the tag metadata from the document structure analytics server to a client computing device, wherein the tag metadata that is sent to the client computing device is unapplied to a document [Col. 3, lines 3-17, “the source data element would be assigned a default identifier … the tagging of source data elements may be overseen by an end user. That is, an end user may be allowed to review, change, or delete an assigned identifier for a source data element … If this default identifier is incorrect, the user may change the default designation. When the default designation is updated … the source data element and corresponding attributes may be used in future analyses to correctly identify source data elements … the updated source data element may be stored in the training corpus in the database 150”; It can be seen that the elements are tagged then send to the end user to evaluate to see whether the assigned identifier is correct], but is configured to be applied to an untagged document stored at the client computing device based on user input that (a) is received via the client computing device, and (b) confirms at least a portion of the tag metadata [Col. 5, lines 54-56, “receive digital source content (e.g., source containers) from various other systems or devices, such as the client device 110” (receiving untagged document from the client device); Col. 6, lines 7-38, “After receiving the digital source content, the system 105 may execute the parsing module 135 to extract source data elements from the source content … The assignment module 140 tags this element based upon comparisons between the source data element and previously tagged source data elements in a training corpus, or as compared against other source data elements extracted from the source container”; abstract, “generating a structured document from the tagged source data elements”; It can be seen that, the tagged elements first sent to the user for review and updated, the updated tagged elements are then stored. When the untagged document is received from the user, the stored tagged elements are used to label the elements in the received untagged document, and a structured document us generated based on the labeled elements].  
McKinney does not teach
the predictive model applies a machine learning algorithm to make predictions.
Forman teaches
the predictive model applies a machine learning algorithm to make predictions [paragraph 0025, “training a categorizer 106 from a training set 102 of labeled records. Each record in training set 102 is labeled with its correct category assignment. Inducer 104 receives the training set 102 and constructs the categorizer 106 using a machine learning algorithm.”].
Claim 22 is rejected using the same rationale as claim 1.

As per claim 23, McKinney and Forman teach the document structure analytics server of Claim 22.
McKinney further teaches
wherein a particular one of the generated feature-value pairs [Col. 6, lines 29-34, font size of title text (font is a feature, size is a value)] defines a proportion [40 percent larger than other text] of content [body text or other text element] comprising the particular training document having a particular visual appearance [Col. 7, lines 38-41, “an element property may include size properties such as aspect ratio, shape, and/or proportion (specific or relative to other elements )”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”];

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over McKinney et al. in view of Forman et al. and further in view of Yang et al. (US Pub. 2012/0278705).
As per claim 8, McKinney and Forman teach the document structure extraction method of Claim 1.
McKinney and Forman do not teach
the articular value is selected from a group consisting of left justification, center justification, right justification, and full justification; and 
the particular formatting feature is a paragraph alignment formatting feature.  
Yang teaches
the particular value is selected from a group consisting of left justification, center justification, right justification, and full justification [paragraph 0020, "a title text line(s) may be centrally aligned]; and 
the particular formatting feature is a paragraph alignment formatting feature [paragraph 0020, “a title text line(s) may be centrally aligned and may have a large font size, and a bold font type”].  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the articular value is selected from a group consisting of left justification, center justification, right justification, and full justification; and the particular formatting feature is a paragraph alignment formatting feature of Yang into the method of generating structured documents of .

Claims 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over McKinney et al. in view of Forman et al. in view of Kaasila et al. (US Patent 9,317,777) and further in view of Filimonova (US Pub. 2016/0307067).
As per claim 15, McKinney teaches a non-transitory computer readable medium encoded with instructions that, when executed by one or more processors, cause a document structure analysis process to be invoked [Col. 14. Lines 8-13, “a processor and a memory for storing executable instructions. The processor executes the instructions to perform the method”], the process comprising: 
identifying a plurality of training documents [Col. 11, lines 61-67, “Each time a source container is processed by the system 105 to extract, process, and tag source data elements with identifiers, the tagged source data elements may be stored in the database 150 in a training corpus”]; 
accessing a particular one of the training documents [Col. 11, lines 61-67, “the tagged source data elements may be stored in the database 150 in a training corpus”; Col. 13, lines 3-6, the tagging of source data elements (training documents) may be overseen by an end user, an end user may be allowed to review, change, or delete an assigned identifier for a source data element, especially when the parsing module is unable to determine any attributes for a source data element], the particular training document comprising a plurality of document parts [Col. 4, lines 48-49, header and body text], wherein a particular one of the document parts [Col. 4, lines 48-49, header] has (a) a visual appearance defined by formatting information included in the particular training document [Col. 4, lines 50-52, font size, font type, bold, etc.], and (b) a document part categorization [Col. 4, lines 50-52, “the header text includes a font size that is larger than that of the body text, and may be bold or have a font type that his different than the body text”; Col. 4, lines 53-58, determine the attributes of all source data elements in the source container and compare the attributes of a source data element with previously tagged source data elements in a training corpus; It can be understood that the system not only determining the attributes (visual appearance, feature-value pair, etc.) of the source elements, but also identifying the attributes of the source data in the training corpus in order to compare them]; 
generating, for the particular document part [Col. 6, lines 29-34, a set of word/title/header], one or more feature-value pairs using the formatting information [Col. 6, lines 29-34, font size is 40 percent larger (font is a feature, size or type is a value)], wherein each of the generated one or more feature-value ADO1.P5615US Page 26 of 29pairs characterizes the visual appearance of the particular document part by correlating a particular value with a particular formatting feature [Col. 4, lines 49-52, “the header text includes a font size that is larger than that of the body text, and may be bold or have a font type that is different than the body text”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”], and wherein a particular one of the generated feature-value pairs [Col. 6, lines 29-34, font size of title text (font is a feature, size is a value)] defines a proportion [40 percent larger than other text] of content [body text or other text element] comprising the particular training document having a particular visual appearance [Col. 7, lines 38-41, “an element property may include size properties such as aspect ratio, shape, and/or proportion (specific or relative to other elements )”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”];
“…” the particular predictive model [Col. 6, lines 34-35, the assignment module 140] is configured to establish a predicted document part categorization based on at least one feature-value [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”] received from a client computing device [Col. 5, lines 54-56, receive digital source content ( e.g., source containers) from the client device].
McKinney does not teach
identifying a plurality of training documents, each of which is associated with a document type category (emphasis added);
defining a document part feature vector that links the generated one or more feature-value pairs with the document part categorization, wherein the document part feature vector links 
a feature-value pair that correlates a document part comprising 90% or more of document content with a body paragraph categorization, and 
a feature-value pair that correlates a document part comprising less than 0.1 % of document content with a title categorization; 
storing the document part feature vector in a memory resource hosted by a document structure analytics server; 
using the document part feature vector to train a particular predictive model in a supervised learning framework; 
associating the particular predictive model with the particular document type category; and 
storing the particular predictive model in the memory resource hosted by the document structure analytics server, wherein the memory resource stores a plurality of predictive models, each of which is associated with at least one of a plurality of document type categories.  
Forman teaches
[Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction," … If it is known that books from a particular technical publisher (e.g., Publisher X) are nonfiction (i.e., we have a prior that books from Publisher X belong in any of the non-fiction topics, but not the fiction topics) … Thus, when training the top categorizer 500 to choose between categories 504A and 504B, all of the books from Publisher X can be used as training examples for category 504B”]; 

    PNG
    media_image1.png
    528
    837
    media_image1.png
    Greyscale

associating the particular predictive model with the particular document type category [Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction"; It can be seen in Fig. 5 that when the document type falls in the 504A category (fiction), the categorizer 502A is selected to perform the further process, and when the document type falls in the 504B category (non-fiction), the categorizer 502B is selected]; and  
storing the particular predictive model in the memory resource hosted by the document structure analytics server, wherein the memory resource stores a plurality of predictive models [Fig. 5, paragraph 0030, “an inducer module 306 and a hierarchical categorizer module 308 are stored in main memory 316 … processor 302 executes inducer module 306, which induces top-down hierarchical categorizer 308 based on training items (documents) with labels 402A (shown in FIG. 4) and based on training items with priors 402B (shown in FIG. 4), as described in further detail below. Processor 302 executes the induced hierarchical categorizer 308 to categorize a set of unlabeled items”; paragraph 0041, “categorizer 308 includes a top or root categorizer 500, and seven sub-categorizers 502A, 502B, 502A-1, 502A-2, 502A-3, 502B-1, and 502B-2 (collectively referred to as sub-categorizers 502)”], each of which is associated with at least one of a plurality of document type categories [Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction"; It can be seen in Fig. 5 that when the document type falls in the 504A category (fiction), the categorizer 502A is selected to perform the further process, and when the document type falls in the 504B category (non-fiction), the categorizer 502B is selected].  
using the same rationale as claim 1.
McKinney and Forman do not explicitly teach
defining a document part feature vector that links the generated one or more feature-value pairs with the document part categorization, wherein the document part feature vector links 
a feature-value pair that correlates a document part comprising 90% or more of document content with a body paragraph categorization, and 
a feature-value pair that correlates a document part comprising less than 0.1 % of document content with a title categorization; 
storing the document part feature vector in a memory resource hosted by a document structure analytics server; 

Kaasila teaches 
defining a document part feature vector that links the generated one or more feature-value pairs with the document part categorization, wherein the document part feature vector links  [Col. 22, lines 14-25, “receiving data representing features of a first font and data representing features of a second font … The first font and the second font are capable of representing one or more glyphs … The features for each font may be represented as a vector of font features, each vector may include numerical values that represent the features … of the corresponding font”]; 
storing the document part feature vector in a memory resource hosted by a document structure analytics server [Col. 15, lines 67-Col. 16, lines 1-2, “store font features (e.g., calculated feature vectors) in a font feature database … for later retrieval and use”]; and 
using the document part feature vector to train a particular predictive model in a supervised learning framework [Col. 18, lines 14-18, “one input vector may represent the features of "Font A" while a second input vector may represent the features of "Font B" and a third input vector may represent the features of "Font C". Other types of data may also be input to represent the fonts used for training the learning machine”; Col. 8, lines 31-34, “supervised learning techniques may be implemented in which training is based on a desired output that is known for an input”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the process of using the feature vector that associated with the one or more feature-value pairs to train a machine learning of Kaasila into the method of generating structured documents of McKinney. Doing so would help determining the similarity between the fonts (Kaasila, abstract).
McKinney, Forman and Kaasila do not teach

a feature-value pair that correlates a document part comprising less than 0.1 % of document content with a title categorization; 
Filimonova teaches
a feature-value pair that correlates a document part comprising 90% or more of document content with a body paragraph categorization [Fig. 4, paragraph 0168, “classifier 204 is configured … to determine the location and size of the logotype portion 402”; paragraph 0109, “The content can be said to be split into a logotype portion 402 (also sometimes called a "document header") and a main body portion 404 … the logotype portion 402 is a letterhead header … the logotype portion 402 … can be generally considered to be a certain pre-defined header portion of the content, such as but not limited to top ten percent of the page size”; Fig. 4 shows the content of the digital document is split into a logotype portion 402 (document header") and a main body portion 404, and since the header portion is ten percent of the document, thus the body portion is 90 percent of the document], and 
a feature-value pair that correlates a document part comprising less than 0.1 % of document content with a title categorization [Fig. 4, paragraph 0168, “classifier 204 is configured … to determine the location and size of the logotype portion 402”; paragraph 0109, “The content can be said to be split into a logotype portion 402 (also sometimes called a "document header") and a main body portion 404 … the logotype portion 402 is a letterhead header … the logotype portion 402 … can be generally considered to be a certain pre-defined header portion of the content, such as but not limited to top ten percent of the page size”]; 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have included a feature-value pair that correlates a document part comprising 90% or more of document content with a body paragraph categorization, and a feature-value pair that 

As per claim 16, McKinney, Forman, Kaasila and Filimonova teach the non-transitory computer readable medium of Claim 15.
McKinney further teaches
one of the generated feature-value pairs associates a font size formatting feature with a particular font size value [Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … knowing these attributes the assignment module 140 may tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”].

As per claim 17, McKinney, Forman, Kaasila and Filimonova teach the non-transitory computer readable medium of Claim 15.
Forman further teaches
the plurality of training documents are identified based on a common characteristic that defines the particular document type category [Fig. 5, paragraph 0045, “assume that books are being categorized into dozens of categories. Assume that category 504A in FIG. 5 is "fiction" and category 504B is "non-fiction"], and that is selected from a group consisting of an author and a topic keyword [paragraph 0003, “Categorization involves assigning items (e.g., documents, products, patients, etc.) into categories based on features of the items (e.g., which words appear in a document), and possibly subject to a degree of confidence”].  
Claim 17 is rejected using the same rationale as claim 15.

Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over McKinney et al. in view of Forman et al. and further in view of Filimonova (US Pub. 2016/0307067).
As per claim 24, McKinney and Forman teach the document structure analytics server of Claim 22.
McKinney further teaches
a particular one of the generated feature-value pairs [Col. 6, lines 29-34, font size of title text (font is a feature, size is a value)] defines a proportion [40 percent larger than other text] of content [body text or other text element] comprising the particular training document having a particular visual appearance [Col. 7, lines 38-41, “an element property may include size properties such as aspect ratio, shape, and/or proportion (specific or relative to other elements )”; Col. 6, lines 29-34, “a set of words may include attributes such as: (a) a font size that is 40 percent larger than other text element on the same page; and (b) an italicized font … tag the set of words (e.g., source data element) with an identifier, such as "Title Text"”];
McKinney further teaches in Col. 8 “elements, the assignment module 140 may assign or tag each source data element with a particular identifier … an identifier may include "body" or "header"”.
McKinney and Forman do not teach
the tag metadata associates a document part that constitutes less than 0.1% of content comprising the untagged document with a predicted categorization of title; and 
the tag metadata associates a document part that constitutes more than 90% of content comprising the untagged document with a predicted categorization of body paragraph. 
Filimonova teaches
the tag metadata associates a document part that constitutes less than 0.1% of content comprising the untagged document with a predicted categorization of title [Fig. 4, paragraph 0168, “classifier 204 is configured … to determine the location and size of the logotype portion 402”; paragraph 0109, “The content can be said to be split into a logotype portion 402 (also sometimes called a "document header") and a main body portion 404 … the logotype portion 402 is a letterhead header … the logotype portion 402 … can be generally considered to be a certain pre-defined header portion of the content, such as but not limited to top ten percent of the page size”]; and 
the tag metadata associates a document part that constitutes more than 90% of content comprising the untagged document with a predicted categorization of body paragraph [Fig. 4, paragraph 0168, “classifier 204 is configured … to determine the location and size of the logotype portion 402”; paragraph 0109, “The content can be said to be split into a logotype portion 402 (also sometimes called a "document header") and a main body portion 404 … the logotype portion 402 is a letterhead header … the logotype portion 402 … can be generally considered to be a certain pre-defined header portion of the content, such as but not limited to top ten percent of the page size”; Fig. 4 shows the content of the digital document is split into a logotype portion 402 (document header") and a main body portion 404, and since the header portion is ten percent of the document, thus the body portion is 90 percent of the document]. 
Claim 24 is rejected using the same rationale as claim 15.

Prior Art

The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Richardson et al. (US Pub. 2013/0174017) describes a method for document content reconstruction are provided in a digital content.
McKinney et al. (US Pub. 2016/0342578) describes a method for generating structured documents.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRI T NGUYEN whose telephone number is 571-272-0103.  The examiner can normally be reached on M-F, 8 AM-5 PM, (CT).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ALEXEY SHMATOV can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available 




/T. N./Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123