DETAILED ACTION
This communication is in response to the application filed on 8/5/20 in which claims 1-20 were presented for examination.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 12/1/20 and 1/21/21 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 7 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter 
Claim 7 recites the limitation “identifying, in an individual document, a chunk that is commonly occurring in the documents of the document set but does not appear to occur in the individual document.” The term “appear to occur” is indefinite at least because it is unclear whether the chunk does or does not occur in the claimed individual document. Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


s 1, 6, 7, 10-12 and 15-17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Leal et al. (U.S. Pub. No. 2018/0300315) (“Leal”).

Regarding claim 1, Leal discloses [a] method implemented on a computer system executing instructions for analyzing and annotating documents, the method comprising: 
accessing a document set that contains a plurality of documents; (Leal, paragraph 74)
automatically identifying chunks within individual documents in the document set (a) based on the content, layout and contexts in the individual document; (Leal, paragraphs 72-84, teaches performing tokenization, chunking, and contextual model generation on the documents, analyzes lexical patterns in the documents that are statistically relevant to the document text (lexical pattern comprises a linguistic expression including tokens as well as formatting and morphological variations)) and (b) based on patterns of content, layout and contexts across the documents in the document set; and (Leal, paragraph 72-84, teaches utilizing a latent semantic indexing to identify a statistically relevant pattern; receives a set of natural language texts and generates one or more relationship patterns between word forms within the set of texts) 
annotating documents in the document set based on analysis of the identified chunks from documents within the document set (Leal, paragraph 121, teaches preprocessing a corpus of documents into a structure having various fields representing aspects of the document; the fields include one or more tags associated with the document). 
Claims 19 and 20 are apparatus and CRM claims corresponding to claim 1 and are similarly rejected.

Regarding claim 6, Leal discloses the invention of claim 1 as discussed above. Leal further discloses annotating some of the identified chunks with metadata describing the chunk, wherein identifying counterpart chunks in different documents is based on similarity of the metadata (Leal, paragraph 19, teaches comparing tags between documents; Leal, paragraph 22, teaches finding correlated tags; Leal, paragraph 116, teaches utilizing tags to match expressions between documents).

Regarding claim 7, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein identifying chunks based on patterns across the documents in the document set comprises: 
identifying, in an individual document, a chunk that is commonly occurring in the documents of the document set but does not appear to occur in the individual document (Leal, paragraph 77, teaches performing latent semantic indexing to identify statistical relevant lexical patterns across the document set (latent semantic indexing involves analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms)).112

Regarding claim 10, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein some of the identified chunks contain content that is descriptive of semantic roles played by other chunks (Leal, paragraph 77, teaches that the identified lexical 

Regarding claim 11, Leal discloses the invention of claim 1 as discussed above. Leal further discloses annotating some of the identified chunks with a datatype of the chunk and a semantic role of the chunk (Leal, paragraph 121, teaches converting the documents into a structured format, for example, a document is preprocessed into a structure having various fields representing aspects of the documents; these fields include the generated tags associated with the document). 

Regarding claim 12, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein identifying chunks based on layout comprises: 
grouping line-oriented text into structural chunks, wherein the grouping is based on word shapes, first and last tokens, formatting characteristics, and/or punctuation (Leal, paragraph 77, teaches that the identified lexical patterns are based on identifying linguistic expressions including tokens such as verbs, adjectives, nouns, adverbs, and combinations of these, as well as formatting (bold, caps, etc.) and morphological (verb tenses, plural and singulars, etc.) variations).

Regarding claim 15, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein identifying chunks based on layout comprises: 
identifying structural chunks based on layout of non-text structural features, wherein the non-text structural features comprise at least one of a figure, a table, a sidebar, a footnote, and a page header or footer (Leal, paragraphs 37, 38, teaches removing tables or images appearing within a document).

Regarding claim 16, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein identifying chunks based on content comprises: identifying chunks using Al techniques for topic estimation (Leal, paragraph 17, teaches a topic modeling approach to detect lexico-statistic patterns of abstract topics in the text; Leal, paragraph 47, teaches further a semantic topic model created using machine learning algorithms).

Regarding claim 17, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein identifying chunks based on content comprises: using few-shot Named Entity recognition techniques to identify chunks within the set of documents (Leal, FIG. 4, paragraphs 81, 87, 90, teaches a dictionary approach to identify new tags for a document).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
Claims 2-5 are rejected under 35 U.S.C. 103 as being unpatentable over Leal as applied to claim 1 above, and further in view of Musgrove (U.S. Pub. No. 2006/0235870) (“Musgrove”).

Regarding claim 2, Leal discloses the invention of claim 1 as discussed above. Leal, paragraph 74, teaches receiving a set of documents. Yet, Leal does not disclose assembling the document set by clustering documents into the document set based on similarity of content and/or layout. However, Musgrove, paragraph 39, teaches a taxonomy interlinking system that includes a clustering module used to group, i.e., classify, a plurality of documents into clusters Leal to incorporate the teachings of Musgrove to cluster the received set of documents. Doing so would enable the document set to serve as nodes for allowing interlinking of the clusters together (Musgrove, paragraph 39).

Regarding claim 3, Leal discloses the invention of claim 1 as discussed above. Leal, paragraphs 72-84, teaches performing tokenization, chunking, and contextual model generation on the documents. Yet, Leal does not disclose wherein automatically identifying chunks within individual documents in the document set is further (c) based on identifying semantic roles within the individual document; and (d) based on identifying counterpart chunks in different documents in the document set, wherein counterpart chunks play a same semantic role in different documents. However, Musgrove, paragraph 36, teaches a taxonomy interlinking system that allows the node names and the texts of the electronic documents to be analyzed based on how the words are used in context by extracting and comparing a vector of semantic features, including relations of nouns to verbs as variously an actor, object, instrument, or other semantic role, and differentiate a particular word’s pattern of occurrences in the electronic documents. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Musgrove to analyze the semantic role of the lexical strings in the received set of documents. Doing so would enable analyzing the documents based on the context, rather than merely analyzing the text based on definitions of the words (Musgrove, paragraph 36).

Regarding claim 4, Leal, in view of Musgrove, discloses the invention of claim 3 as discussed above. Leal, paragraphs 72-84, teaches performing tokenization, chunking, and contextual model generation on the documents, analyzes lexical patterns in the documents that are statistically relevant to the document text (lexical pattern comprises a linguistic expression including tokens as well as formatting and morphological variations). Yet, Leal does not particularly disclose wherein identifying counterpart chunks in different documents comprises: identifying content that is different in different documents but occurs within substantially similar contexts within the different documents. However, Musgrove, paragraph 36, teaches a taxonomy interlinking system that allows the node names and the texts of the electronic documents to be analyzed based on how the words are used in context by extracting and comparing a vector of semantic features, including relations of nouns to verbs as variously an actor, object, instrument, or other semantic role, and differentiate a particular word’s pattern of occurrences in the electronic documents. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Musgrove to analyze the semantic role of the lexical strings in the received set of documents. Doing so would enable analyzing the documents based on the context, rather than merely analyzing the text based on definitions of the words (Musgrove, paragraph 36).

Regarding claim 5, Leal, in view of Musgrove, discloses the invention of claim 3 as discussed above. Leal further discloses wherein identifying counterpart chunks in different documents comprises: identifying content that is substantially the same in different documents (Leal, paragraphs 72-84, teaches performing tokenization, chunking, and contextual model generation on the documents, analyzes lexical patterns in the documents that are statistically relevant to the document text (lexical pattern comprises a linguistic expression including tokens as well as formatting and morphological variations)).

Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Leal as applied to claim 1 above, and further in view of Carrier et al. (U.S. Pub. No. 2016/0070693) (“Carrier”).

Regarding claim 8, Leal discloses the invention of claim 1 as discussed above. Leal further discloses wherein the identified chunks comprise: structural chunks that contain content comprising structures within the layout of the documents (Leal, paragraphs 37, 38, teaches identifying tables or images by parsing the formatting content of the document). 
Leal does not disclose wherein the identified chunks comprise: field chunks that contain content within the documents suitable for use as fields in document templates. However, Carrier, paragraphs 8, 9, teaches applying natural language processing to unstructured data within a target form to identify elements of a form structure. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Carrier to apply natural language processing to the documents to identify form elements. Doing so would enable detecting form criteria without relying upon headers (Carrier, paragraph 7).

Regarding claim 9, Leal, in view of Carrier, discloses the invention of claim 8 as discussed above. Leal does not disclose wherein some of the field chunks are hierarchical and contain other chunks as sub-chunks. However, Carrier, paragraphs 8, 9, teaches applying natural language processing to unstructured data within a target form to identify elements of a hierarchical form structure. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Carrier to apply natural language processing to the documents to identify form elements. Doing so would enable detecting form criteria without relying upon headers (Carrier, paragraph 7).

Claims 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Leal as applied to claim 1 above, and further in view of Dejean (U.S. Pub. No. 2011/0276874) (“Dejean”).

Regarding claim 13, Leal discloses the invention of claim 1 as discussed above. Leal, paragraphs 37, 38, teaches removing tables or images appearing within a document. Yet, Leal does not disclose wherein identifying chunks based on layout comprises: identifying spatial boundaries of structural chunks using machine learning inference trained on tiles of page images. However, Dejean, paragraph 3, teaches using geometric page analysis to recognize the different elements of a page as they are laid out on a document image (i.e., layout objects) based on exploiting the geometric or layout features. It would have been obvious to one of Leal to incorporate the teachings of Dejean to infer the boundaries of the tables of images in the document. Doing so would enable reducing the amount of noise data present within a received document (Leal, paragraph 37).

Regarding claim 14, Leal discloses the invention of claim 1 as discussed above. Leal, paragraphs 37, 38, teaches removing tables or images appearing within a document. Yet, Leal does not disclose wherein identifying chunks based on layout comprises: identifying spatial boundaries of structural chunks using artificial intelligence-based visual recognition of geometric patterns of the layout. However, Dejean, paragraph 3, teaches using geometric page analysis to recognize the different elements of a page as they are laid out on a document image (i.e., layout objects) based on exploiting the geometric or layout features. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Dejean to infer the boundaries of the tables of images in the document. Doing so would enable reducing the amount of noise data present within a received document (Leal, paragraph 37).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Leal as applied to claim 1 above, and further in view of Brugger, R. et al., A DTD Extension for Document Structure Brugger”).

Regarding claim 18, Leal discloses the invention of claim 1 as discussed above. Leal, paragraphs 72-84, teaches performing tokenization, chunking, and contextual model generation on the documents. Yet, Leal does not disclose receiving user corrections for incorrectly identified chunks; and improving the step of automatically identifying chunks in response to the user corrections. However, Brugger, section 3 (page 4), teaches a machine learning approach to building a document model where the model is generated interactively by the user such that if the recognition on a new document fails, corrections are manually performed and the corrected tree is passed to the learning algorithm. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Leal to incorporate the teachings of Brugger to update the model generation based on user corrections. Doing so would enable the user to update the model interactively (Brugger). 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Zuev (see PTO892) teaches cross-language text clustering; Oro (see PTO892) teaches object extraction from presentation-oriented documents using a semantic and spatial approach; Mansfield (see PTO892) teaches a method for efficient cluster analysis; Clar (see PTO892) teaches a method to attribute metadata to preexisting documents; Matsumoto (see Young (see PTO892) teaches a method of constructing a document type definition from a set of structured documents; Knudson (see PTO892) teaches a method for annotating and linking electronic documents; DULAM (see PTO892) teaches identifying homogenous clusters; DAWSON (see PTO892) teaches a method for generating a document representation; Viola (see PTO892) teaches grammatical parsing of document visual structures.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAHID K KHAN whose telephone number is (571)270-0419.  The examiner can normally be reached on M-F, 9-5 est.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on (571)272-4124.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access 






/SHAHID K KHAN/Examiner, Art Unit 2178