Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This non-final office action is in response to the Application filed on 1/23/2021, with a priority to 62/965,516, 62/965,523, 62/965,520, filed 1/24/2020 and 62/975,514 filed 2/12/2020.
Claim(s) 1-14 are pending for examination. Claim(s) 1 is/are independent claim(s).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-7, 9, 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ilić; Andreja et al. US Pub. No. 2020/0184013 (Ilić) in view of Sim; David Alexander et al. US Pub. No. 2020/0219481 (Sim).

Claim 1: 
	Ilić teaches: 
A method for extracting headers [¶ 0005-06] (detect headings), comprising:
receiving an input body of text containing a plurality of chunks of text [¶ 0032-34] (paragraph by paragraph analysis, a paragraph is a “chunk”);
identifying a set of features of each chunk [¶ 0036-44] (direct formatting features) [¶ 0045-54] (relative formatting features) [¶ 0055-69] (syntactical features);
classifying each text chunk as a potential header depending on whether the chunk includes a mark or title text [¶ 0073] (portion provided to level classifier) [¶ 0082] (clustering detected headings);
… ; and
comparing the … potential headers to each other and to a remainder of the input body of text not included in the … potential headers to confirm whether each cleaned potential header is a header [¶ 0074-81] (linear regression on possible headings) [¶ 0082] (clustering component on possible headings) [¶ 0050-51] (indentation, compare to previous, compare to next) [¶ 0068-69] (text length, compare to previous, compare to next) [¶ 0079] (normalized font size comparison) [¶ 0080] (normalized indentation comparison).

	Ilić fails to teach, but Sim teaches: 
identifying any boilerplate in each potential header and removing it to form cleaned potential headers [abstract, ¶ 0002, 06, 15-16] (using the similarity scores to determine labels associated with text elements comprising boilerplate text) [¶ 0032-34] (remove boilerplate and pass the remaining text to indexer, trimmer removes boilerplate); and
{comparing the} cleaned {potential headers to each other and to a remainder of the input body of text not included in} the cleaned {potential headers to confirm whether each cleaned potential header is a header} [¶ 0013, 16-17, 43, 50, 55] (comparing local language models for text elements having the same label to derive similarity indicators, and using the similarity indictors to derive a similarity score for that label).
	Text would be passed though the method of Sim and cleaned and then passed to the method of Ilić, just as Sim passes text to an indexer. 

	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of document heading detection in Ilić and the method of removing boilerplate in Sim, with a reasonable expectation of success. 
	The motivation for this combination would have been “improved identification of boilerplate” and “improved information retrieval” [Sim: ¶ 0016].
One in the art would recognize that the increased functionality associated with added features resulting from the combination of features were predictable results.

Claim 2: 
	Ilić teaches: 
The method of claim 1, wherein the features include typography characteristics [¶ 0041] (font size) [¶ 0038] (italic) [¶ 0039] (underline) [¶ 0037] (bold).

Claim 3: 

The method of claim 2, wherein the features include at least two or more of font family, font size, italic, bold, underline, space above, space left, space left first line, and justification [¶ 0041] (font size) [¶ 0038] (italic) [¶ 0039] (underline) [¶ 0037] (bold) [¶ 0002] (justification).

Claim 4: 
	Ilić teaches: 
 [Examiner’s interpretation: orthography includes uppercase, as discloses in applicant’s specification, see published specification ¶ 0116.]
The method of claim 1, wherein the features include orthography characteristics [¶ 0040, 78] (upper case) [¶ 0054] (all caps).

Claim 5: 
	Ilić teaches: 
The method of claim 1, wherein the features include page layout [¶ 0053] (distance to neighbor) [¶ 0042, 50-52, 80] (indentation) [¶ 0044] (alignment).

Claim 6: 
	Ilić teaches: 
The method of claim 1, wherein the features include at least two or more of typography characteristics, orthography characteristics and page layout [¶ 0041] (font size) [¶ 0038] (italic) [¶ 0039] (underline) [¶ 0037] (bold) [¶ 0053] (distance to neighbor)  [¶ 0040, 78] (upper case) [¶ 0054] (all caps).

Claim 7: 
	Ilić teaches: 
The method of claim 1 further comprising determining if a chunk includes title text by at least comparing features of the chunk to features of a remainder of the input body of text and identifying title text if its features differ from those of a majority of the remainder [¶ 0049-52, 79-80] (compared to normalized).

Claim 9: 
	Ilić teaches: 
The method of claim 1, wherein the comparison of cleaned potential headers includes determining a similarity among all of the cleaned potential headers based on their features [¶ 0074-81] (linear regression on possible headings) [¶ 0082] (clustering component on possible headings) [¶ 0050-51] (indentation, compare to previous, compare to next) [¶ 0068-69] (text length, compare to previous, compare to next) [¶ 0079] (normalized font size comparison) [¶ 0080] (normalized indentation comparison)..

Claim 11: 
	Ilić teaches [¶ 0059-67, 82, 87] (threshold). 
	Sim teaches: 
The method of claim 1, wherein identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a threshold [¶ 0047, 54] (similarity threshold).

Claim(s) 8, 10, 12, 13, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ilić; Andreja et al. US Pub. No. 2020/0184013 (Ilić) in view of Sim; David Alexander et al. US Pub. No. 2020/0219481 (Sim) in view of Gelosi; Patrizio US Pub. No. 2019/0114479 (Gelosi).
Claim 8: 
	Ilić, Sim teach all the elements shown above.  
	Ilić teaches [¶ 0056-69] (number of characters in paragraph, sentence count, word count). 
	Ilić, Sim fail to teach, but Gelosi teaches: 
The method of claim 1, wherein the comparison of cleaned headers includes comparing the number of characters included in the cleaned potential headers and chunks of text in the input body of text covered by the cleaned potential headers to a total number of characters in the input body of text [¶ 0192, 234, 386] (max length and length of the document string are the “total number of characters”) [¶ 0014, 156-157, 256] (removing page number, a page header, a page footer, and a footnote, excluding page number, page numbers, headers and footers are all boilerplate and removing them from the document means it is “cleaned”).


	The motivation for this combination would have been “improve the document navigability” and “improved performance” [Gelosi: ¶ 0005, 249, 383].
One in the art would recognize that the increased functionality associated with added features resulting from the combination of features were predictable results.

Claim 10: 
	Gelosi teaches: 
The method of claim 1, wherein the comparison of cleaned potential headers includes discounting groups of similar cleaned potential headers based on an average number of characters among the cleaned potential headers [¶ 0344] (normal distribution function, normalizing factor, means the eligibility values are compared to the average) [¶ 0370] (similarity string functions, using the Levenshtein distance would be based on a statistical average).

Claim 12: 
	Gelosi teaches: 
The method of claim 1, wherein identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a number of character edits required to transform each potential header in the group into a subsequent potential header in the group [¶ 0074] (marker sequence) [¶ 0130] (subsequent character) [¶ 066-267] (sequential number) [¶ 0368-371] (similarity comparison based on sequential number is a threshold edit of 1).

Claim 13: 
	Gelosi teaches: 
The method of claim 1, wherein identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a threshold and to a number of character edits required to transform each potential header in the group into a subsequent potential header in the group [¶ 0074] (marker sequence) [¶ 0130] (subsequent character) [¶ 066-267] (sequential number) [¶ 0368-371] (similarity comparison based on sequential number is a threshold edit of 1). 

Claim 14: 
	Gelosi teaches: 
The method of claim 1 wherein identifying boilerplate includes comparing potential headers to a set of one or more predetermined non-boilerplate words [¶ 0344] (normal distribution function, normalizing factor, means the eligibility values are compared to the whole) [¶ 0370] (similarity string functions, using the Levenshtein distance would include non-boiler plate words).

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Please See PTO-892: Notice of References Cited.

Evidence of the level skill of an ordinary person in the art for Claim 1: 
Mehra; Ashutosh et al. US 20210117667 teaches: identifying document structural elements and correcting errors in the classification and/or location of the identified structural elements, identify based on font style. 
Khan; Shahzad US 20130311169 teaches: document structure analysis. identifying salient text, removing boiler-plate text. 
Jovanovic; Vuk et al. US 20130191366 teaches: Identify headers, footers, watermarks; pattern matching engine determines when the upper or lower parts of a certain number of pages contain the same or similar content at the same position. 
Dejean; Herve et al. US 20060156226 teaches: identifying header/footer content of a document, similarity, edit distance. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN J SMITH whose telephone number is (571)270-3825.  The examiner can normally be reached on Monday - Friday 11:00 - 7:30 EST.
Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Scott Baderman can be reached on (571)272-3644.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/Benjamin Smith/Examiner, Art Unit 2144                                                                                                                                                                                                        Direct Phone: 571-270-3825
Direct Fax: 571-270-4825
Email: benjamin.smith@uspto.gov