DETAILED ACTION
This communication is in response to the amendment filed 9/13/22 in which claims 1, 3, 19, and 20 were amended. Claims 1-11 and 13-20 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 9/13/22 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
Applicant’s arguments with respect to claims 1, 19, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-11 and 13-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The independent claims recite a method implemented on a computer system executing instructions for analyzing and annotating documents, the method comprising accessing a set of documents, automatically identifying chunks within individual documents and their semantic roles in a transaction described by the documents, wherein the chunks and the semantic roles are identified based on the content, geometric layout and contexts in the individual document and based on patterns of content, geometric layout and context across the documents in the document set. The independent claims further recite annotating the documents based on an analysis of the identified chunks wherein the annotations include locations of the identified chunks and the semantic roles played by the identified chunks at those locations. 
The courts consider a mental process (thinking) one that “can be performed in the human mind, or by a human using a pen and paper,” to be an abstract idea. CyberSource Corp. v. Retail Decisions, Inc., 654 F.3d 1366, 1372, 99 USPQ2d 1690, 1695 (Fed. Cir. 2011). As the Federal Circuit explained, “methods which can be performed mentally, or which are the equivalent of human mental work, are unpatentable abstract ideas the ‘basic tools of scientific and technological work’ that are open to all.’” 654 F.3d at 1371, 99 USPQ2d at 1694 (citing Gottschalk v. Benson, 409 U.S. 63, 175 USPQ 673 (1972)). Accordingly, the “mental processes” abstract idea grouping is defined as concepts performed in the human mind, and examples of mental processes include observations, evaluations, judgments, and opinions. The courts do not distinguish between claims that recite mental processes performed by humans and claims that recite mental processes performed on a computer. As the Federal Circuit has explained, “[c]ourts have examined claims that required the use of a computer and still found that the underlying, patent-ineligible invention could be performed via pen and paper or in a person’s mind.” Versata Dev. Group v. SAP Am., Inc., 793 F.3d 1306, 1335, 115 USPQ2d 1681, 1702 (Fed. Cir. 2015).
The limitations of the independent claims, as drafted, are processes that, under their broadest reasonable interpretation, cover performance of the limitations in the mind but for the recitation of generic computer components. That is, other than reciting “computer system executing instructions,” nothing in the claim elements precludes the steps from practically being performed in the mind. For example, but for the “computer system executing instructions” language, the limitation of accessing a document set that contains a plurality of documents in the context of these claims encompasses the user manually accessing or obtaining a set of documents. Similarly, the limitation of automatically identifying chunks within individual documents and their semantic roles in a transaction described by the documents, wherein the chunks and the semantic roles are identified based on the content, geometric layout and contexts in the individual document and based on patterns of content, geometric layout and context across the documents in the document set, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. Further, similarly, the limitation of annotating the documents based on an analysis of the identified chunks wherein the annotations include locations of the identified chunks and the semantic roles played by the identified chunks at those locations, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the independent claims recite an abstract idea.
The dependent claim limitations of assembling the document by clustering documents based on similarity of content and geometric layout (claim 2), identifying counterpart chunks in different documents where the chunks are different but play the same semantic role (claim 3), identifying content that is different in different documents but occurs within substantially similar contexts within the different documents (claim 4), identifying content that is substantially the same in different documents (claim 5), annotating some of the identified chunks with metadata describing the chunk wherein identifying the counterpart chunks in different documents is based on a similarity of the metadata (claim 6), identifying a chunk in an individual document that is commonly occurring in the documents but does not occur in the individual document (claim 7), wherein the identified chunks comprise field chunks that contain content within the documents suitable for use as fields in document templates and structural chunks that contain content comprising structures within the geometric layout of the documents (claim 8), wherein some of the field chunks are hierarchical and contain other chunks as sub-chunks (claim 9), wherein some of the identified chunks contain content that is descriptive of semantic roles played by other chunks (claim 10), wherein the annotations further comprise datatypes of the identified chunks (claim 11), wherein identifying chunks based on geometric layout comprises identifying spatial boundaries of structural chunks using machine learning inference trained on tiles of page images (claim 13), wherein identifying chunks based on geometric layout comprises identifying spatial boundaries of structural chunks using artificial intelligence-based visual recognition of geometric patterns of the geometric layout (claim 14), wherein identifying chunks based on geometric layout comprises identifying structural chunks based on geometric layout of non-text structural features, wherein the non-text structural features comprise at least one of a figure, a table, a sidebar, a footnote, and a page header or footer (claim 15), wherein identifying chunks based on content comprises identifying chunks using AI techniques for topic estimation (claim 16), wherein identifying chunks based on content comprises using few-shot Named Entity recognition techniques to identify chunks within the set of documents (claim 17), receiving user corrections for incorrectly identified chunks and improving the step of automatically identifying chunks in response to the user corrections (claim 18), as drafted, are processes that, under their broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of high-level computer processing techniques. That is, other than reciting “machine learning inference,” “artificial intelligence-based visual recognition,” “AI techniques,” “few-shot Named Entity recognition techniques,” nothing in the claim elements precludes the steps from practically being performed in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the dependent claims recite an abstract idea.
This judicial exception is not integrated into a practical application. The additional elements are recited at a high-level of generality such that they amount to no more than mere instructions to apply the exception using generic computer components. Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements amount to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Accordingly, the claims are not patent eligible.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
Claims 1, 3-11, 14-16, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rajan (US 2012/0005686 A1; published Jan. 5, 2012) in view of Byron (US 2015/0178853 A1; published Jun. 25, 2015).
Regarding claim 1, Rajan discloses [a] method implemented on a computer system executing instructions for analyzing and annotating documents, the method comprising: 
accessing a document set that contains a plurality of documents; (see paragraph 8 (set of documents))
automatically identifying chunks within individual documents in the document set and also automatically identifying the semantic roles played by the chunks in a transaction described by the individual documents; (see paragraphs 8 (machine learning model learns the characteristics (features) of a set of documents that are correlated), 15 (web page is decomposed into a set of functional segments, takes each page segment and assigns a functional label of a type), 23 (boiler plate segment, e.g., privacy policy statement, disclaimer, copyright, or other legal disclosures, containing elements that appear on a majority of the pages on a particular site )
wherein both identifying the chunks and identifying their semantic roles are (a) based on the content, geometric layout and contexts in the individual document; and (b) based on patterns of content, geometric layout and contexts across the documents in the document set; and (see paragraphs 15 (functional labels provide information about the role of a segment on the page), 19 (main content is usually located in the middle of the page and can be recognized by HTML tags that specify a long and fat text box containing a large fraction of the total number of text sentences and the use of a variety of font types), 20 (user generated content includes content such as user comments, forum posts, posts on boards, product reviews, is obtained from users interacting with the web page rather than provided by the initial author, is often found at the bottom of the page and is recognized by HTML tags that specify repeated element but with different content))
annotating documents in the document set based on analysis of the identified chunks from documents within the document set, wherein the annotations include locations of the identified chunks and the semantic roles played by the identified chunks at those locations (see paragraph 15 (a functional label is assigned to each segment, the functional label (e.g., main content, user-generated content) providing information about the role of a segment on the page)). 
Rajan does not expressly disclose identifying the semantic roles played by the chunks in a transaction described by the individual documents. However, Byron teaches analyzing a corpus of unstructured documents to identify one or more asset transfer flow relationships between entities and outputting the results of the analysis. Paragraph 5. Such monetary flow relations can be thought of as a specialized kind of semantic role restrictions. Paragraph 22. After generating a seed list of related tuples that exhibit an asset transfer flow relation, one or more documents of one or more corpora may be analyzed to identify sentences or portions of content (e.g., tables, titles, metadata, etc.) containing the arguments of the tuples. Paragraph 33. Clusters of generalization from instance to type may be generated to formulate selectional restrictions for the type of entity that can participate in each relationship. Paragraph 38. Thus, containment relations and semantic role restrictions are populated from relations extracted from unstructured data. Paragraph 39. The semantic role restrictions may have the form <predicate>, <financial entity>, <payer>, and <payee>. Paragraph 40. These restrictions may be used to extract these types of part-whole relations and semantic role and the asset transfer flow model may be used to process documents or other unstructured data in a corpus to analyze the documents to identify candidate answers in a QA system, e.g., information related to school expenditures. Paragraphs 45-46. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Byron to analyze and identify semantic roles related to an asset transfer flow in a corpus of web documents, at least because doing so would enable generating knowledge resources for subsequent analysis. Byron, paragraph 18.
Claims 19 and 20 are apparatus and CRM claims corresponding to claim 1 and are similarly rejected.

Regarding claim 3, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan does not particularly disclose wherein automatically identifying chunks within individual documents in the document set is further (c) based on identifying counterpart chunks in different documents in the document set, wherein counterpart chunks are different chunks in different documents that play a same semantic role in different documents. However, Byron teaches analyzing a corpus of unstructured documents to identify one or more asset transfer flow relationships between entities and outputting the results of the analysis. Paragraph 5. Such monetary flow relations can be thought of as a specialized kind of semantic role restrictions (e.g., property owners pay property tax to municipalities, airlines pay gate fees to airports, individuals pay ticket fees to airlines, etc.). Paragraph 22. The lexical indications of asset transfer relations between entities are heterogeneous across the set of relations. Paragraph 26. The words used to express financial or asset transfers between particular payers/payees changes throughout the graph or model, although the underlying quality of the relation (the fact that it represents an asset transfer) is homogeneous throughout the graph or model. Id. After generating a seed list of related tuples that exhibit an asset transfer flow relation, one or more documents of one or more corpora may be analyzed to identify sentences or portions of content (e.g., tables, titles, metadata, etc.) containing the arguments of the tuples. Paragraph 33. From the resources gathered as a part of the mining of the portions of content of the documents of the corpus, clusters of wordings used for particular payment types may be generated, e.g., a government authority may “levee payments” for gate fees of “collect” a tax for fuel use, etc. Paragraph 37. Clusters of generalization from instance to type may be generated to formulate selectional restrictions for the type of entity that can participate in each relationship. Paragraph 38. Thus, containment relations and semantic role restrictions are populated from relations extracted from unstructured data. Paragraph 39. The semantic role restrictions may have the form <predicate>, <financial entity>, <payer>, and <payee>. Paragraph 40. These restrictions may be used to extract these types of part-whole relations and semantic role and the asset transfer flow model may be used to process documents or other unstructured data in a corpus to analyze the documents to identify candidate answers in a QA system, e.g., information related to school expenditures. Paragraphs 45-46. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Byron to analyze and identify semantic roles related to an asset transfer flow in a corpus of web documents, at least because doing so would enable generating knowledge resources for subsequent analysis. Byron, paragraph 18.

Regarding claim 4, Rajan, in view of Byron, discloses the invention of claim 3 as discussed above. Rajan teaches machine learning module for each segment type is generated by analyzing all of the training data segments that have been classified with the same segment type; features that are relevant to the presentation of the segments on the screen are extracted into a feature vector and correlated with the functional category; a classifier takes a segment and a set of segment features as input and outputs a probability that the segment with these features should be classified as the particular type. Paragraph 32. Yet, Rajan does not particularly discloses wherein identifying counterpart chunks in different documents comprises: identifying content that is different in different documents but occurs within substantially similar contexts within the different documents. However, Byron teaches analyzing a corpus of unstructured documents to identify one or more asset transfer flow relationships between entities and outputting the results of the analysis. Paragraph 5. Such monetary flow relations can be thought of as a specialized kind of semantic role restrictions (e.g., property owners pay property tax to municipalities, airlines pay gate fees to airports, individuals pay ticket fees to airlines, etc.). Paragraph 22. The lexical indications of asset transfer relations between entities are heterogeneous across the set of relations. Paragraph 26. The words used to express financial or asset transfers between particular payers/payees changes throughout the graph or model, although the underlying quality of the relation (the fact that it represents an asset transfer) is homogeneous throughout the graph or model. Id. After generating a seed list of related tuples that exhibit an asset transfer flow relation, one or more documents of one or more corpora may be analyzed to identify sentences or portions of content (e.g., tables, titles, metadata, etc.) containing the arguments of the tuples. Paragraph 33. Clusters of generalization from instance to type may be generated to formulate selectional restrictions for the type of entity that can participate in each relationship. Paragraph 38. Thus, containment relations and semantic role restrictions are populated from relations extracted from unstructured data. Paragraph 39. The semantic role restrictions may have the form <predicate>, <financial entity>, <payer>, and <payee>. Paragraph 40. These restrictions may be used to extract these types of part-whole relations and semantic role and the asset transfer flow model may be used to process documents or other unstructured data in a corpus to analyze the documents to identify candidate answers in a QA system, e.g., information related to school expenditures. Paragraphs 45-46. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Byron to analyze and identify semantic roles related to an asset transfer flow in a corpus of web documents, at least because doing so would enable generating knowledge resources for subsequent analysis. Byron, paragraph 18.

Regarding claim 5, Rajan, in view of Byron, discloses the invention of claim 3 as discussed above. Rajan does not specifically disclose wherein identifying counterpart chunks in different documents comprises: identifying content that is substantially the same in different documents. However, Byron teaches analyzing a corpus of unstructured documents to identify one or more asset transfer flow relationships between entities and outputting the results of the analysis. Paragraph 5. Such monetary flow relations can be thought of as a specialized kind of semantic role restrictions (e.g., property owners pay property tax to municipalities, airlines pay gate fees to airports, individuals pay ticket fees to airlines, etc.). Paragraph 22. The lexical indications of asset transfer relations between entities are heterogeneous across the set of relations. Paragraph 26. The words used to express financial or asset transfers between particular payers/payees changes throughout the graph or model, although the underlying quality of the relation (the fact that it represents an asset transfer) is homogeneous throughout the graph or model. Id. After generating a seed list of related tuples that exhibit an asset transfer flow relation, one or more documents of one or more corpora may be analyzed to identify sentences or portions of content (e.g., tables, titles, metadata, etc.) containing the arguments of the tuples. Paragraph 33. Clusters of generalization from instance to type may be generated to formulate selectional restrictions for the type of entity that can participate in each relationship. Paragraph 38. Thus, containment relations and semantic role restrictions are populated from relations extracted from unstructured data. Paragraph 39. The semantic role restrictions may have the form <predicate>, <financial entity>, <payer>, and <payee>. Paragraph 40. These restrictions may be used to extract these types of part-whole relations and semantic role and the asset transfer flow model may be used to process documents or other unstructured data in a corpus to analyze the documents to identify candidate answers in a QA system, e.g., information related to school expenditures. Paragraphs 45-46. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Byron to analyze and identify semantic roles related to an asset transfer flow in a corpus of web documents, at least because doing so would enable generating knowledge resources for subsequent analysis. Byron, paragraph 18.

Regarding claim 6, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses annotating some of the identified chunks with metadata describing the chunk, wherein identifying counterpart chunks in different documents is based on similarity of the metadata (see paragraphs 25 (machine learning mechanism uses training data to learn the correlation between web page features), 26 (machine learning mechanism is used to create classifier modules that recognize the features associated with a particular category label), 29 (the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment), 30 (each segment is associated with metadata that includes a feature vector that describes features of the segment)).

Regarding claim 7, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on patterns across the documents in the document set comprises: 
identifying, in an individual document, a chunk that is commonly occurring in the documents of the document set but does not occur in the individual document (see paragraphs 30-35 (a segment of a web page is classified to belong (or not to belong) to a particular segment type based on a machine learning model trained on a set of web pages annotated by human editors)).

Regarding claim 8, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein the identified chunks comprise: structural chunks that contain content comprising structures within the geometric layout of the documents, wherein the identified chunks comprise: field chunks that contain content within the documents suitable for use as fields in document templates (see FIG. 4, paragraph 29 (elements and attributes define both content type and presentation layout, initially partitioned into segments, the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment; Table 1 – location of segment as displayed on the screen, height and length of text box, user input forms, boiler plate, etc.)). 

Regarding claim 9, Rajan, in view of Byron, discloses the invention of claim 8 as discussed above. Rajan further discloses wherein some of the field chunks are hierarchical and contain other chunks as sub-chunks (see Fig. 4; para. 0029 (elements and attributes define both content type and presentation layout. initially partitioned into segments, the presentation features in the HTML elements and attributes associated with each segment are extracted and stored as metadata associated with each segment; Table 1 - Location of segment as displayed on the screen, Height and length of text box, User input forms, Boiler Plate and Advertisement Phrases, element type(s) (site navigation)); see also paragraph 33 (segmentation constructs a DOM tree from the web page and works to group each of the nodes in the DOM tree into a segment. Location-based segmentation defines regions of a web page such as top, middle, bottom, left, and right and uses presentation information in HTML tag attributes to group together portions of the web page)).

Regarding claim 10, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan does not specifically disclose wherein some of the identified chunks contain content that is descriptive of semantic roles played by other chunks. However, Byron teaches analyzing a corpus of unstructured documents to identify one or more asset transfer flow relationships between entities and outputting the results of the analysis. Paragraph 5. Such monetary flow relations can be thought of as a specialized kind of semantic role restrictions (e.g., property owners pay property tax to municipalities, airlines pay gate fees to airports, individuals pay ticket fees to airlines, etc.). Paragraph 22. The lexical indications of asset transfer relations between entities are heterogeneous across the set of relations. Paragraph 26. The words used to express financial or asset transfers between particular payers/payees changes throughout the graph or model, although the underlying quality of the relation (the fact that it represents an asset transfer) is homogeneous throughout the graph or model. Id. After generating a seed list of related tuples that exhibit an asset transfer flow relation, one or more documents of one or more corpora may be analyzed to identify sentences or portions of content (e.g., tables, titles, metadata, etc.) containing the arguments of the tuples. Paragraph 33. From the resources gathered as a part of the mining of the portions of content of the documents of the corpus, clusters of wordings used for particular payment types may be generated, e.g., a government authority may “levee payments” for gate fees of “collect” a tax for fuel use, etc. Paragraph 37. Clusters of generalization from instance to type may be generated to formulate selectional restrictions for the type of entity that can participate in each relationship. Paragraph 38. Thus, containment relations and semantic role restrictions are populated from relations extracted from unstructured data. Paragraph 39. The semantic role restrictions may have the form <predicate>, <financial entity>, <payer>, and <payee>. Paragraph 40. These restrictions may be used to extract these types of part-whole relations and semantic role and the asset transfer flow model may be used to process documents or other unstructured data in a corpus to analyze the documents to identify candidate answers in a QA system, e.g., information related to school expenditures. Paragraphs 45-46. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Byron to analyze and identify semantic roles related to an asset transfer flow in a corpus of web documents, at least because doing so would enable generating knowledge resources for subsequent analysis. Byron, paragraph 18.

Regarding claim 11, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein the annotations further comprise datatypes of the identified chunks (see Table 1 (segment features of a particular segment role include the presence of user input forms such as search boxes, radio buttons, submit buttons, and height and length of text boxes)). 

Regarding claim 14, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on geometric layout comprises: identifying spatial boundaries of structural chunks using artificial intelligence-based visual recognition of geometric patterns of the layout (see FIG. 2, Table 1, paragraph 29 (examples of features that can be extracted and later correlated with functional categories, also included in the table are some heuristics for establishing a correlation between feature metadata and a functional category), paragraph 31 (main content is in the middle of the page and describes what Riley’s Place is. The Links to Donor’s Web Sites, classified as content pointers 240, are in a smaller font on the right margin of the page, and a boiler plate on the bottom left), paragraph 33 (location based segmentation defines regions; vision based segmentation breaks the web page down into segments based on organizing HTML tags)).

Regarding claim 15, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on geometric layout comprises: identifying structural chunks based on geometric layout of non-text structural features, wherein the non-text structural features comprise at least one of a figure, a table, a sidebar, a footnote, and a page header or footer (see Table 1 (features include the presence of boiler plate segments)).

Regarding claim 16, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan further discloses wherein identifying chunks based on content comprises: identifying chunks using Al techniques for topic estimation (see paragraph 25).

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan and Byron as applied to claim 1 above, and further in view of Musgrove (US 2006/0235870 A1; published Oct. 19, 2006).
Regarding claim 2, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan teaches a document analysis framework comprising a machine learning mechanism that uses training data to learn the correlation between web page features, i.e., pre-assembled collections of web pages. See, e.g., paragraph 25. Rajan does not specifically disclose assembling the document set by clustering documents into the document set based on similarity of content and/or geometric layout. However, Musgrove, paragraph 39, teaches a taxonomy interlinking system that includes a clustering module used to group, i.e., classify, a plurality of documents into clusters based on how they relate to one another, for example, using semantic resemblance. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Musgrove to cluster the received set of documents. Doing so would enable the document set to serve as nodes for allowing interlinking of the clusters together (Musgrove, paragraph 39).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan and Byron as applied to claim 1 above, and further in view of Dejean (US 2011/0276874 A1; published Nov. 10, 2011).
Regarding claim 13, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan does not disclose wherein identifying chunks based on geometric layout comprises: identifying spatial boundaries of structural chunks using machine learning inference trained on tiles of page images. However, Dejean, paragraph 3, teaches using geometric page analysis to recognize the different elements of a page as they are laid out on a document image (i.e., layout objects) based on exploiting the geometric or layout features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Dejean to infer the boundaries of the tables of images in the document. Doing so would enable reducing the amount of noise data present within a received document.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan and Byron as applied to claim 1 above, and further in view of Fritzler, “Few-shot classification in named entity recognition task,” SAC ’19 (published Apr. 2019).
Regarding claim 17, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Although Rajan teaches using a machine learning mechanism that uses training data to learn the correlation web page features and a conclusion about the relevance or importance of a web page segment with those features, Rajan does not expressly disclose wherein identifying chunks based on content comprises: using few-shot Named Entity recognition techniques to identify chunks within the set of documents. However, Fritzler teaches using few shot NER to label entities. See Introduction. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to apply few shot NER to the learning of segment features and their classification. Doing so would enable identification of instances with extremely small number of labeled examples.  

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Rajan and Byron as applied to claim 1 above, and further in view of Brugger, A DTD Extension for Document Structure Recognition, Lecture Notes in Computer Science book series (LNCS, volume 1375), May 22, 2006, pages 1-12.
Regarding claim 18, Rajan, in view of Byron, discloses the invention of claim 1 as discussed above. Rajan does not disclose receiving user corrections for incorrectly identified chunks; and improving the step of automatically identifying chunks in response to the user corrections. However, Brugger, section 3 (page 4), teaches a machine learning approach to building a document model where the model is generated interactively by the user such that if the recognition on a new document fails, corrections are manually performed and the corrected tree is passed to the learning algorithm. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Rajan to incorporate the teachings of Brugger to update the model generation based on user corrections. Doing so would enable the user to update the model interactively. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAHID K KHAN whose telephone number is (571)270-0419. The examiner can normally be reached M-F, 9-5 est.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on (571)272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHAHID K KHAN/Examiner, Art Unit 2178