DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

	Claims 1-22 recite variations on the concept of “segment pairs”.  These variations include “pairs of character segments”, “identified segment pairs”, “pair of segments”, and “identified pair of character segments”.  It appears these are all intended to refer to the same items, but the inconsistent usage throughout the set of claims makes it unclear if differences are intended and what treatment in claim interpretation should be given.  The examiner recommends that a standardized phrase be selected and the claims amended to consistently use it throughout the claims for clarity.

	Independent claims 1 and 12 recite a preamble with an intended use of “for classifying semi-structured documents”.  The preamble has no effect on the rest of the claim and is given no weight in claim interpretation.  The remaining claims ultimately depend on either of claims 1 or 12 and are rejected for the same reason.
First, the documents in the preamble are never referenced again, because the first line of the body of each claim recites “accessing a plurality of documents” and does not refer back to the documents previously introduced.
Second, the documents in the preamble have the limitation “semi-structured” applied to them.  As discussed in the first point, these documents are never referred to elsewhere in the claim.  Therefore, none of the claims – except for claims 10 and 21 which explicitly recite HTML – are interpreted as being limited to only semi-structured documents.
When reading the preamble in the context of the entire claim, the recitation “classifying semi-structured documents” is not limiting because the body of the claim describes a complete invention and the language recited solely in the preamble does not provide any distinct definition of any of the claimed invention’s limitations. Thus, the preamble of the claim(s) is not considered a limitation and is of no significance to claim construction. See Pitney Bowes, Inc. v. Hewlett-Packard Co., 182 F.3d 1298, 1305, 51 USPQ2d 1161, 1165 (Fed. Cir. 1999). See MPEP § 2111.02.

	Claims 4 and 15 are rejected because it is unclear what “machine-learning model” exists because one was never created.  Even if a model had been implicitly created related to the training step, the claims do not draw any connection between the ML information extractor and the model.

	Claims 7 and 18 are rejected because it is unclear how two segments can overlap.  This would only be feasible if the description provided in Specification paragraph [0020] were recited in the claim.  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-22 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Privault et al. (US 2010/0312725 A1, hereinafter “Privault”).

	Regarding claim 1, Privault teaches 
A method for classifying semi-structured documents, the method comprising: 
accessing a plurality of documents [Privault, ¶ 0043, set of documents]; 
identifying in each of the plurality of documents pairs of character segments, and generating a map of unique identified segment pairs across all documents in the plurality of documents [Privault, ¶ 0064, multiword expressions evaluated as vectors]; 
generating for each of the plurality of documents a respective feature vector based on one or more unique pair of segments in the map that are also identified in the document [Privault, ¶¶ 0063 & 0064, representation based on word or multiword expressions]; and 
clustering the plurality of documents into a plurality of clusters, using the feature vectors [Privault, ¶ 0061].

Regarding claim 2, Privault teaches the method of claim 1, further comprising, prior to the generating step: 
computing for each unique segment pair in the map a normalized document frequency [Privault, ¶ 0063, word frequency vector representation]; and 
removing from the map segment pairs having a frequency less than a specified threshold [Privault, ¶ 0062, responsive vs. non-responsive classification based on a similarity probability value having a difference less than a threshold value].

Regarding claim 3, Privault teaches the method of claim 1, further comprising: 
associating a respective layout template for each cluster in the plurality of clusters, the layout template for a particular cluster being based on [Privault, ¶ 0089, layout similarity used to organize document clusters]: 
one or more segment pairs corresponding to feature vectors associated with that particular cluster [Privault, ¶ 0089, layout similarity used to organize document clusters]; and 
respective layout information of each of the one or more segment pairs [Privault, ¶ 0089, layout similarity used to organize document clusters].

Regarding claim 4, Privault teaches the method of claim 3, further comprising, for each cluster in the plurality of clusters: 
training a respective machine-learning (ML) information extractor to extract document information using the respective layout template associated with the cluster [Privault, ¶ 0050, labels and ¶ 0051, classifier model using support vector machines]; and 
storing a respective ML model [Privault, ¶ 0051, memory stores classifier model].

Regarding claim 5, Privault teaches the method of claim 3, further comprising: 
selecting a document from the plurality of documents [Privault, ¶ 0051, classifier model using support vector machines]; 
determining a cluster in the plurality of clusters to which the document belongs [Privault, ¶ 0051, classifier model using support vector machines]; and 
using a machine-learning (ML) information extractor, trained using the respective layout template associated with the determined cluster, to extract document information [Privault, ¶ 0051, classifier model using support vector machines].

Regarding claim 6, Privault teaches the method of claim 1, wherein for a document in the plurality of documents, the step of identifying in the document pairs of character segments is performed for each page in the document [Privault, ¶ 0063, all documents evaluated].

Regarding claim 7, Privault teaches the method of claim 1, wherein an identified pair of character segments comprises two segments that overlap horizontally or vertically [Privault, ¶ 0064, multiword].

Regarding claim 8, Privault teaches the method of claim 7, wherein an identified pair of character segments comprises two segments that are separated by a hop comprising a specified number of characters [Privault, ¶ 0064, multiword expressions can be separated by a space (hop) character].

Regarding claim 9, Privault teaches the method of claim 1, wherein a document in the plurality of documents is obtained via file transfer, email, web access, or scanning of a physical document [Privault, ¶ 0044, email, HTML files, OCR document processing].

Regarding claim 10, Privault teaches the method of claim 1, 
wherein a document in the plurality of documents comprises a hyper-text markup language (HTML) document [Privault, ¶ 0044, HTML documents], the identification of pairs of character segments in the HTML document comprising: 
identifying HTML tags representing textual information by parsing the HTML document using a script executable in a headless mode [Privault, ¶ 0050, automatic class determination from HTML and metadata information]; and 
accessing location and size information of the HTML tags [Privault, ¶ 0050, automatic class determination from HTML and metadata information].

Regarding claim 11, Privault teaches the method of claim 1, wherein the plurality of documents comprises a plurality of invoices [Privault, ¶ 0044, records and accounts].

Claims 12-22 recite limitations corresponding to claims 1-11, respectively, and are rejected for the same reasons discussed above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Scott A. Waldron whose telephone number is (571)272-5898. The examiner can normally be reached Monday - Friday 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neveen Abel-Jalil can be reached on (571)270-0474. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Scott A. Waldron/Primary Examiner, Art Unit 2152                                                                                                                                                                                                        08/24/2022