DETAILED ACTION
This communication is in response to the after final response filed 1/21/22 in which claims 1, 2, 8, 13-15, 19, and 20 were amended, and claim 12 was canceled. Claims 1-11 and 13-20 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/21/22 has been entered.
 
Response to Arguments
Applicant’s arguments with respect to claims 1, 19, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-11, 14-16, 19, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Rajan (US 2012/0005686 A1; published Jan. 5, 2012).

Regarding claim 1, Rajan discloses [a] method implemented on a computer system executing instructions for analyzing and annotating documents, the method comprising: 
accessing a document set that contains a plurality of documents; (see paragraph 8 (set of documents))
automatically identifying chunks within individual documents in the document set and also automatically identifying the semantic roles within the individual documents played by the chunks; (see paragraphs 8 (machine learning model learns the characteristics (features) of a set of documents that are correlated), 15 (web page is decomposed into a set of functional segments, takes each page segment and assigns a functional label of a type))
wherein both identifying the chunks and identifying their semantic roles are (a) based on the content, geometric layout and contexts in the individual document; and (b) based on patterns of content, geometric layout and contexts across the documents in the document set; and (see paragraphs 15 (functional labels provide information about the role of a segment on the page), 19 (main content is usually located in the middle of the page and can be recognized by HTML tags that specify a long and fat text box containing a large fraction of the total number of text sentences and the use of a variety of font types), 20 (user generated content includes content such as user comments, forum posts, posts on boards, product reviews, is obtained from users interacting with the web page rather than provided by the initial author, is often found at the bottom of the page and is recognized by HTML tags that specify repeated element but with different content))
annotating documents in the document set based on analysis of the identified chunks from documents within the document set, wherein the annotations include locations of the identified chunks and the semantic roles played by the identified chunks at those locations (see paragraph 15 (a functional label is assigned to each segment, the functional label (e.g., main content, user-generated content) providing information about the role of a segment on the page)). 
Claims 19 and 20 are apparatus and CRM claims corresponding to claim 1 and are similarly rejected.

Regarding claim 3, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein automatically identifying chunks within individual documents in the document set is further (c) based on identifying counterpart chunks in different documents in the document set, wherein counterpart chunks play a same semantic role in different documents (see paragraphs 26 (machine learning mechanism may be used to create classifier modules that recognize the features associated with a particular category label), 32 (machine learning module for each segment type is generated by analyzing all of the training data segments that have been classified with the same segment type; features that are relevant to the presentation of the segments on the screen are extracted into a feature vector and correlated with the functional category; a classifier takes a segment and a set of segment features as input and outputs a probability that the segment with these features should be classified as the particular type)).

Regarding claim 4, Rajan discloses the invention of claim 3 as discussed above. Rajan further discloses wherein identifying counterpart chunks in different documents comprises: identifying content that is different in different documents but occurs within substantially similar contexts within the different documents (see paragraphs 26 (machine learning mechanism may be used to create classifier modules that recognize the features associated with a particular category label), 32 (machine learning module for each segment type is generated by analyzing all of the training data segments that have been classified with the same segment type; features that are relevant to the presentation of the segments on the screen are extracted into a feature vector and correlated with the functional category; a classifier takes a segment and a set of segment features as input and outputs a probability that the segment with these features should be classified as the particular type)). 

Regarding claim 5, Rajan discloses the invention of claim 3 as discussed above. Rajan further discloses wherein identifying counterpart chunks in different documents comprises: identifying content that is substantially the same in different documents (see paragraphs 26 (machine learning mechanism may be used to create classifier modules that recognize the features associated with a particular category label), 32 (machine learning module for each segment type is generated by analyzing all of the training data segments that have been classified with the same segment type; features that are relevant to the presentation of the segments on the screen are extracted into a feature vector and correlated with the functional category; a classifier takes a segment and a set of segment features as input and outputs a probability that the segment with these features should be classified as the particular type)).

Regarding claim 6, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses annotating some of the identified chunks with metadata describing the chunk, wherein identifying counterpart chunks in different documents is based on similarity of the metadata (see paragraphs 25 (machine learning mechanism uses training data to learn the correlation between web page features), 26 (machine learning mechanism is used to create classifier modules that recognize the features associated with a particular category label), 29 (the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment), 30 (each segment is associated with metadata that includes a feature vector that describes features of the segment)).

Regarding claim 7, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on patterns across the documents in the document set comprises: 
identifying, in an individual document, a chunk that is commonly occurring in the documents of the document set but does not occur in the individual document (see paragraphs 30-35 (a segment of a web page is classified to belong (or not to belong) to a particular segment type based on a machine learning model trained on a set of web pages annotated by human editors)).

Regarding claim 8, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein the identified chunks comprise: structural chunks that contain content comprising structures within the geometric layout of the documents, wherein the identified chunks comprise: field chunks that contain content within the documents suitable for use as fields in document templates (see FIG. 4, paragraph 29 (elements and attributes define both content type and presentation layout, initially partitioned into segments, the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment; Table 1 – location of segment as displayed on the screen, height and length of text box, user input forms, boiler plate, etc.)). 

Regarding claim 9, Rajan discloses the invention of claim 8 as discussed above. Rajan further discloses wherein some of the field chunks are hierarchical and contain other chunks as sub-chunks (see Fig. 4; para. 0029 (elements and attributes define both content type and presentation layout. initially partitioned into segments, the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment; Table 1 - Location of segment as displayed on the screen, Height and length of text box, User input forms, Boiler Plate and Advertisement Phrases, element type(s) (site navigation)); see also paragraph 33 (segmentation constructs a DOM tree from the web page and works to group each of the nodes in the DOM tree into a segment. Location-based segmentation defines regions of a web page such as top, middle, bottom, left, and right and uses presentation information in HTML tag attributes to group together portions of the web page)).

Regarding claim 10, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein some of the identified chunks contain content that is descriptive of semantic roles played by other chunks (see paragraph 35 (each classifier outputs a probability that the segment should be assigned a particular category)).

Regarding claim 11, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein the annotations further comprise datatypes of the identified chunks (see Table 1 (segment features of a particular segment role include the presence of user input forms such as search boxes, radio buttons, submit buttons, and height and length of text boxes)). 

 wherein identifying chunks based on geometric layout comprises: identifying spatial boundaries of structural chunks using artificial intelligence-based visual recognition of geometric patterns of the layout (see FIG. 2, Table 1, paragraph 29 (examples of features that can be extracted and later correlated with functional categories, also included in the table are some heuristics for establishing a correlation between feature metadata and a functional category), paragraph 31 (main content is in the middle of the page and describes what Riley’s Place is. The Links to Donor’s Web Sites, classified as content pointers 240, are in a smaller font on the right margin of the page, and a boiler plate on the bottom left), paragraph 33 (location based segmentation defines regions; vision based segmentation breaks the web page down into segments based on organizing HTML tags)).

Regarding claim 15, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on geometric layout comprises: identifying structural chunks based on geometric layout of non-text structural features, wherein the non-text structural features comprise at least one of a figure, a table, a sidebar, a footnote, and a page header or footer (see Table 1 (features include the presence of boiler plate segments)).

Regarding claim 16, Rajan discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on content comprises: identifying chunks using Al techniques for topic estimation (see paragraph 25).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
2 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan as applied to claim 1 above, and further in view of Musgrove (US 2006/0235870 A1; published Oct. 19, 2006).

Regarding claim 2, Rajan discloses the invention of claim 1 as discussed above. Rajan teaches a document analysis framework comprising a machine learning mechanism that uses training data to learn the correlation between web page features, i.e., pre-assembled collections of web pages. See, e.g., paragraph 25. Rajan does not specifically disclose assembling the document set by clustering documents into the document set based on similarity of content and/or geometric layout. However, Musgrove, paragraph 39, teaches a taxonomy interlinking system that includes a clustering module used to group, i.e., classify, a plurality of documents into clusters based on how they relate to one another, for example, using semantic resemblance. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Musgrove to cluster the received set of documents. Doing so would enable the document set to serve as nodes for allowing interlinking of the clusters together (Musgrove, paragraph 39).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan as applied to claim 1 above, and further in view of Dejean (US 2011/0276874 A1; published Nov. 10, 2011).

Regarding claim 13, Rajan discloses the invention of claim 1 as discussed above. Rajan does not disclose wherein identifying chunks based on geometric layout comprises: identifying spatial boundaries of structural chunks using machine learning inference trained on tiles of page images. However, Dejean, paragraph 3, teaches using geometric page analysis to recognize the different elements of a page as they are laid out on a document image (i.e., layout objects) based on exploiting the geometric or layout features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Dejean to infer the boundaries of the tables of images in the document. Doing so would enable reducing the amount of noise data present within a received document.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan as applied to claim 1 above, and further in view of Fritzler, “Few-shot classification in named entity recognition task,” SAC ’19 (published Apr. 2019).

Regarding claim 17, Rajan discloses the invention of claim 1 as discussed above. Although Rajan teaches using a machine learning mechanism that uses training data to learn the correlation web page features and a conclusion about the relevance or importance of a web page segment with those features, Rajan does not expressly disclose wherein identifying chunks based on content comprises: using few-shot Named Entity recognition techniques to identify chunks within the set of documents. However, Fritzler teaches using few shot NER to label entities. See Introduction. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to apply few .  

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan as applied to claim 1 above, and further in view of Brugger, A DTD Extension for Document Structure Recognition, Lecture Notes in Computer Science book series (LNCS, volume 1375), May 22, 2006, pages 1-12.

Regarding claim 18, Rajan discloses the invention of claim 1 as discussed above. Rajan does not disclose receiving user corrections for incorrectly identified chunks; and improving the step of automatically identifying chunks in response to the user corrections. However, Brugger, section 3 (page 4), teaches a machine learning approach to building a document model where the model is generated interactively by the user such that if the recognition on a new document fails, corrections are manually performed and the corrected tree is passed to the learning algorithm. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Brugger to update the model generation based on user corrections. Doing so would enable the user to update the model interactively. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAHID K KHAN whose telephone number is (571)270-0419. The examiner can normally be reached M-F, 9-5 est.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on (571)272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.