DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/28/2022 has been entered.
	Claims 1-20 are pending.
	Claims 1, 10 and 19 are independent claims.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.








Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-12 and 14-18 are rejected under 35 U.S.C. 103 as being unpatentable over Graf et al. (hereinafter Graf, U.S. Patent Application Publication No. 2006/0242180 A1, filed 07/23/2004, published 10/26/2006) in view of Yang, Jaeyoung et al. (hereinafter Yang, “An Interface Agent for Wrapper-Based Information Extraction,” (2005), Springer, pp. 291-302), and in further view of Dezon K. Finch (hereinafter Finch, “TagLine: Information Extraction for Semi-Structured Text Elements in Medical Progress Notes,” (2012), 251 total pages), and in further view of Sigursson et al. (hereinafter Sigursson, “Heritrix User Manual,” © 2004, Internet Archive, 57 pages), and in further view of Zhu et al. (hereinafter Zhu, U.S. Patent No. 8,042,112 B1, filed 06/30/2004, issued 10/18/2011), and in further view of Liu et al. (hereinafter Liu, “Extracting Patient Demographics and Personal Medical Information from Online Health Forums,” AMIA ... Annual Symposium proceedings. AMIA Symposium 2014 (2014): 1825-34).
Regarding independent claim 1, Graf teaches:
A computer-implemented method for extracting data from a plurality of data sources (at least Abstract [Wingdings font/0xE0] Graf teaches a process, system and workflow for extracting and warehousing data from semi-structured documents in any language), comprising:
Notes: a “data source” can be a website. The Specification only at p. 3, [0021] mentions the phrase “base site” of a “data source”, but does not provide an explanation of what it is. The Examiner speculates that the phrase “base site” refers to, at least in the case of a website/webpage, the website/webpage itself? Please explain!

Further, if the “data sources” in the invention are all web-based (e.g., webpage(s) or website(s)), as is at least suggested by the currently recited claim language (e.g. “base site of a data source”), then the Examiner strongly suggests changing instances of the phrase “data source(s)” to webpage(s) and/or website(s)!

Further, the term “site(s)” appears to refer to sites (e.g. websites) of data source(s), but may also refer to “site(s)” within a data source where information (e.g. demographical) may be found. That is, the position(s) or location(s) within the data source(s). Please clarify!

While the term “path(s)” appears to describe navigational paths (e.g. directory path(s), Xpath, etc.) to information (e.g., demographical information), the term “path(s)” may also refer to the path(s) in a decision tree navigating from one node to another node. Which is it?

Further, the Broadest Reasonable Interpretation (BRI) of the limitation below only requires that the decision tree specify a single (one) path.

generating a decision tree for each of the plurality of data sources, wherein the decision tree specifies one or more paths from a base site of a data source of the plurality of data sources to respective sites of the data source (at least p. 1, [0004]; p. 4, [0064]; p. 5, [0078]-[0079], [0081]; pp. 10-11, [0118]-[0120]; pp. 10-12, [0124]-[0148]; p. 13, [0167]-[0169]; p. 14, [0172]; Figures 3, 5-6, 8, 34, 36, 40 [Wingdings font/0xE0] Graf teaches the building (generation) of “text mining term models” that may be constructed for documents either automatically or by building highly specific decision trees using a wizard (e.g., see p. 4, [0064])).
Graf fails to teach that the data sources are explicitly web-based.
However, Yang teaches data extraction from web pages (at least Abstract) utilizing decision trees (see at least pp. 298-301).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yang with those of Graf as both inventions are related to the extraction of data from documents utilizing decision trees. Adding the teaching of Yang to Graf provides Graf with an efficient method of extracting information from web pages.
Graf and Yang fail to explicitly teach:
generating, based on the decision tree, a list of tasks corresponding each of the plurality of data sources, wherein each task includes instructions for how to extract demographic information corresponding to a respective site corresponding to the one of the one or more paths.
However, Finch teaches:
generating, based on the decision tree, a list of tasks corresponding each of the plurality of data sources, wherein each task includes instructions for how to extract demographic information corresponding to a respective site corresponding to the one of the one or more paths (at least p. viii, Abstract; pp. 11-13, 1. Introduction; pp. 15-34, 2. The Domain; pp. 59-62, Section 4; pp. 81-101; pp. 116-117, Section 8.1 [Wingdings font/0xE0] Finch describes a system (TagLine™) for the extraction of medical information from Electronic Health Records (EHR) of Veterans Administration (VA) patients (pp. 15-34). The EHR is semi-structured in nature (like web pages). The medical information is extracted through the induction of decision trees from the labeling of individual lines (see pp. 59-62, Section 4). of each EHR for each patient. A decision tree is induced (trained) on the labeled EHR’s whose rules are then used to predict and extract the medical information (see pp. 81-101). Specifically, Finch defines a set of document element labels and definitions (see pp. 204-210, Appendix 7) used to label elements contained in medical information. These labels include those for demographic information (see labels Date, Phone Number, Signature, Address Block, Street Number and Name, Location at Address, etc.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Finch with those of Graf and Yang as all three inventions are related to the extraction of data from documents utilizing decision trees. Adding the teaching of Finch to Graf and Yang provides Graf and Yang with an efficient method of extracting medical-type information including demographical information associated with patients.
Note: It is unclear to the Examiner as to what concept is being recited in the limitation below. Please clarify! Further, the Examiner suggests adding the content within brackets may help.

Graf, Yang and Finch fail to explicitly teach:
Note: check limitation for typo (bold, underlined).

generating a user interface to be presented on a display, wherein the user interface indicates a the list of tasks to be performed for each of the plurality of data sources and a status for the list of tasks.
However, Sigursson teaches:
generating a user interface to be presented on a display, wherein the user interface indicates a the list of tasks to be performed for each of the plurality of data sources and a status for the list of tasks (at least pp. 32-42, Chapters 7-8 [Wingdings font/0xE0] Sigursson teaches the “Heritrix” web crawler which includes a web console (via the Console Tab) that is generated so as to start and monitor a crawl (bottom, p. 32). The web console includes a Crawler Status Box (see 7.1.1), a Job Status Box (see 7.1.2) which provide a user with information about the jobs (e.g. tasks) being performed for each crawl).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Sigursson with those of Graf, Yang and Finch as all of these inventions are related to the extraction of data from documents. Adding the teaching of Sigursson to Graf, Yang and Finch provides the combination with a user interface that keeps the user up-to-date as to the progress and status of a crawling operation.
Graf, Yang, Finch and Sigursson fail to explicitly teach:
Note: if the “data sources” in the invention are all web-based (e.g., webpage(s) or website(s)), as is at least suggested by the currently recited claim language (e.g. “base site of a data source”), then the Examiner strongly suggests changing instances of the phrase “data source(s)” to webpage(s) and/or website(s) and instances of “site” to “website”!

navigating within the corresponding data source from the base website to the respective site as specified by the specified path.
However, Zhu teaches:
navigating within the corresponding data source from the base website to the respective site as specified by the specified path (at least Abstract; col. 3, line 21 though col. 4, line 47; Figures 1B, 2 [Wingdings font/0xE0] Zhu generally teaches a search engine crawler(s) 208 (aka “robots”, “bots”) having document identifiers (e.g., URLs) corresponding to documents on a network. The crawling process navigates to each assigned URL and is managed by schedulers. Documents referenced by the document identifiers are then retrieved and delivered to content processing servers 210 which perform several tasks. The crawled information is then stored in various log files).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf, Yang, Finch and Sigursson provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf, Yang, Finch, Sigursson and Zhu fail to explicitly teach:
parsing demographic information from the respective site into separate categories;
storing the parsed demographic information in separate databases based on the separate categories.
However, Liu teaches:
parsing demographic information from the respective site into separate categories; storing the parsed demographic information in separate databases based on the separate categories (at least p. 1825, Abstract; p. 1826-1828; Methods; pp. 1830-1833, Results [Wingdings font/0xE0] Liu describes an integrated biomedical Natural Language Processing (NLP) pipeline that automatically extracts a comprehensive set of patient demographics and medical information from on-line health forums. The pipeline can be adopted to construct structured personal health profiles from unstructured user-contributed content on eHealth social media sites. Further, Liu teaches determining sentence classes including demographic information (DMO and its subclasses, see Table 1). The output from the NLP is a structured XML-formatted patient health profile(s)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Liu with those of Graf, Yang, Finch, Sigursson and Zhu as all these inventions are related to the extraction of data from documents. Adding the teaching of Liu to Graf, Yang, Finch, Sigursson and Zhu provides the combination with NPL methods by which patient demographic and other information may be located, extracted and stored.

Regarding dependent claim 2, Graf, Yang, Finch and Sigursson fail to explicitly teach:
the navigating comprises iteratively accessing the respective site for a predetermined number of attempts when the corresponding data source or respective site is initially inaccessible.
However, Zhu teaches:
the navigating comprises iteratively accessing the respective site for a predetermined number of attempts when the corresponding data source or respective site is initially inaccessible (at least col. 6, lines 15-43; col. 8, line 45 through col. 9, line 29; Figures 4B, 5, 7-8 [Wingdings font/0xE0] Zhu teaches a history log file 218 containing a set of history log records 432. Each history log record 432 contains fields that keep a record of the success/failure of a crawl attempt. Further, Zhu teaches a hash map 504 that maps a URL fingerprint (FP) to a corresponding URL record 508 in a URL status file 506. The URL records 508 include a page importance score, a prior crawl status, and a segment ID. The prior crawl status can include multiple fields, including an error field and an unreachable field. The error field records information associated with a download error (e.g. HTTP Error 4xx) which may indicate that a web page does not exist, or that access is not authorized, or some other error. The error field indicates the number of consecutive times an attempt to download the URL resulted in an error. The unreachable field records information associated with a URL being unreachable (e.g., because the host server is busy). For example, the unreachable field can include the number of consecutive times the URL was unreachable in previous crawls. The segment ID identifies the crawl segment associated with the URL FP at the time that the document download operation was performed or attempted).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that keep track of errors encountered when attempting to crawl a given document.
Regarding dependent claim 3, Graf, Yang, Finch and Sigursson fail to explicitly teach:
receiving an error notification when the corresponding data source or the respective site is inaccessible after completing the predetermined number of attempts.
However, Zhu teaches:
receiving an error notification when the corresponding data source or the respective site is inaccessible after completing the predetermined number of attempts (at least col. 6, lines 15-43; col. 8, line 45 through col. 9, line 29; Figures 4B, 5, 7-8 [Wingdings font/0xE0] Zhu teaches a history log file 218 containing a set of history log records 432. Each history log record 432 contains fields that keep a record of the success/failure of a crawl attempt. Further, Zhu teaches a hash map 504 that maps a URL fingerprint (FP) to a corresponding URL record 508 in a URL status file 506. The URL records 508 include a page importance score, a prior crawl status, and a segment ID. The prior crawl status can include multiple fields, including an error field and an unreachable field. The error field records information associated with a download error (e.g. HTTP Error 4xx) which may indicate that a web page does not exist, or that access is not authorized, or some other error. The error field indicates the number of consecutive times an attempt to download the URL resulted in an error. The unreachable field records information associated with a URL being unreachable (e.g., because the host server is busy). For example, the unreachable field can include the number of consecutive times the URL was unreachable in previous crawls. The segment ID identifies the crawl segment associated with the URL FP at the time that the document download operation was performed or attempted).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that keep track of errors encountered when attempting to crawl a given document.

Regarding dependent claim 5, Graf, Yang, Finch and Sigursson fail to explicitly teach:
managing a plurality of data extractors performing tasks on each of the plurality of data sources.
However, Zhu teaches:
managing a plurality of data extractors performing tasks on each of the plurality of data sources (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 6, Graf, Yang, Finch and Sigursson fail to explicitly teach:
managing the plurality of data extractors comprises managing a maximum number of data extractors performing tasks on each of the plurality of data sources.
However, Zhu teaches:
managing the plurality of data extractors comprises managing a maximum number of data extractors performing tasks on each of the plurality of data sources (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of G Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 7, Graf, Yang, Finch and Sigursson fail to explicitly teach:
when the maximum number of data extractors for a first data source of the plurality of data sources is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a same priority level as the first data source.
However, Zhu teaches:
when the maximum number of data extractors for a first data source of the plurality of data sources is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a same priority level as the first data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 8, Graf, Yang, Finch and Sigursson fail to explicitly teach:
when the maximum number of data extractors for a first data source is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a different priority level as the first data source.
However, Zhu teaches:
when the maximum number of data extractors for a first data source is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a different priority level as the first data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 9, Graf, Yang, Finch and Sigursson fail to explicitly teach:
managing the number of data extractors comprises periodically adjusting the number of data extractors performing tasks on the corresponding data source.
However, Zhu teaches:
managing the number of data extractors comprises periodically adjusting the number of data extractors performing tasks on the corresponding data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang, Finch and Sigursson as all of these inventions are related to the extraction of data from documents. Adding the teaching of Graf, Yang, Finch and Sigursson provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding claims 10-12 and 14-18, claims 10-12 and 14-18 merely recite a non-transitory program storage device for storing the method of claims 1-3 and 5-9, respectively. Thus, Graf in view of Yang, Finch, Sigursson, Zhu and Liu teach every limitation of claims 10-12 and 14-18, and provides proper motivation, as indicated in the rejections of claims 1-3 and 5-9.

Claims 4, 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Graf in view Yang, and in further view of Finch, and in further view of Sigursson, and in further view of Zhu, and in further view of Liu, and in further view of Gupta et al. (hereinafter Gupta, U.S. Patent Application Publication No. 2005/0229151 A1, filed 06/08/2005, published 10/13/2005).
Regarding dependent claim 4, Graf, Yang, Finch, Sigursson, Zhu and Liu fail to explicitly teach:
the user interface further indicates a color code indicator of a priority level of each of the plurality of data sources.
However, Gupta teaches:
the user interface further indicates a color code indicator of a priority level of each of the plurality of data sources (at least Abstract; p. 9, [0099]; Figure 13 [Wingdings font/0xE0] Gupta teaches a project management system that includes a project plan containing a plurality of tasks. Task priorities are calculated and assigned. Further Gupta utilizes color to indicate the priority of a task, high priority is indicated in red, while low priority tasks are indicated as green).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Gupta with those of Graf, Yang, Finch, Sigursson, Zhu and Liu as all of these inventions are related to the extraction of data from documents. Adding the teaching of Gupta provides the combination with an easy to understand visualization of task or job priority levels allowing a user to easily determine the current priority of each task or job in a project.

Regarding claim 13, claim 13 merely recites a non-transitory program storage device for storing the method of claim 4. Thus, Graf in view of Yang, Finch, Sigursson, Zhu and Liu teach every limitation of claim 13, and provides proper motivation, as indicated in the rejection of claim 4.

Regarding claim 20, claim 20 merely recites a system to execute the method of claim 4. Thus, Graf in view of Yang, Finch, Sigursson, Zhu and Liu teach every limitation of claim 20, and provides proper motivation, as indicated in the rejection of claim 4.











Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Graf in view Yang, and in further view of Finch, and in further view of Zhu, and in further view of Sigursson, and in further view of Liu.
Regarding independent claim 19, Graf teaches:
A system comprising:
a first computing device comprising: a first memory; and a first processor communicatively coupled to the first memory and configured to:
Notes: a “data source” can be a website. The Specification only at p. 3, [0021] mentions the phrase “base site” of a “data source”, but does not provide an explanation of what it is. The Examiner speculates that the phrase “base site” refers to, at least in the case of a website/webpage, the website/webpage itself? Please explain!

Further, the term “site(s)” appears to refer to sites (e.g. websites) of data source(s), but may also refer to “site(s)” within a data source where information (e.g. demographical) may be found. That is, the position(s) or location(s) within the data source(s). Please clarify!

Further, if the “data sources” in the invention are all web-based (e.g., webpage(s) or website(s)), as is at least suggested by the currently recited claim language (e.g. “base site of a data source”), then the Examiner strongly suggests changing instances of the phrase “data source(s)” to webpage(s) and/or website(s)!

While the term “path(s)” appears to describe navigational paths (e.g. directory path(s), Xpath, etc.) to information (e.g., demographical information), the term “path(s)” may also refer to the path(s) in a decision tree navigating from one node to another node. Which is it?

Further, the Broadest Reasonable Interpretation (BRI) of the limitation below only requires that the decision tree specify a single (one) path.

generate a decision tree for each of a plurality of data sources, wherein the decision tree comprises one or more paths to respective sites of each of the plurality of data sources (at least p. 1, [0004]; p. 4, [0064]; p. 5, [0078]-[0079], [0081]; pp. 10-11, [0118]-[0120]; pp. 10-12, [0124]-[0148]; p. 13, [0167]-[0169]; p. 14, [0172]; Figures 3, 5-6, 8, 34, 36, 40 [Wingdings font/0xE0] Graf teaches the building (generation) of “text mining term models” that may be constructed for documents either automatically or by building highly specific decision trees using a wizard (see [0064])).
Graf fails to teach that the data sources are explicitly web-based.
However, Yang teaches data extraction from web pages (at least Abstract) utilizing decision trees (see at least pp. 298-301).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yang with those of Graf as both inventions are related to the extraction of data from documents utilizing decision trees. Adding the teaching of Yang to Graf provides Graf with an efficient method of extracting information from web pages.
Graf and Yang fail to explicitly teach:
generate a list of tasks for each of the plurality of data sources based on the decision tree, wherein each task includes instructions for how to extract demographic information corresponding to a site corresponding to the respective one of the one or more paths and comprises instructions for extracting demographic information from the respective site;
However, Finch teaches:
generate a list of tasks for each of the plurality of data sources based on the decision tree, wherein each task includes instructions for how to extract demographic information corresponding to a site corresponding to the respective one of the one or more paths and comprises instructions for extracting demographic information from the respective site (at least p. viii, Abstract; pp. 11-13, 1. Introduction; pp. 15-34, 2. The Domain; pp. 59-62, Section 4; pp. 81-101; pp. 116-117, Section 8.1 [Wingdings font/0xE0] Finch describes a system (TagLine™) for the extraction of medical information from Electronic Health Records (EHR) of Veterans Administration (VA) patients (pp. 15-34). The EHR is semi-structured in nature. The medical information is extracted through the induction of decision trees from the labeling of individual lines (see pp. 59-62, Section 4). of each EHR for each patient. A decision tree is induced (trained) on the labeled EHR’s whose rules are then used to predict and extract the medical information (see pp. 81-101). Specifically, Finch defines a set of document element labels and definitions (see pp. 204-210, Appendix 7) used to label elements contained in medical information. These labels include those for demographic information (see labels Date, Phone Number, Signature, Address Block, Street Number and Name, Location at Address, etc.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Finch with those of Graf and Yang as all three inventions are related to the extraction of data from documents utilizing decision trees. Adding the teaching of Finch to Graf and Yang provides the combination with an efficient method of extracting medical-type information including demographical information associated with patients).
Graf, Yang and Finch fail to explicitly teach:
assign a task from the list of tasks to a second computing device based on a priority level of the corresponding data source; and transmit the assigned task to a corresponding second computing device;
However, Zhu teaches:
assign a task from the list of tasks to a second computing device based on a priority level of the corresponding data source; and transmit the assigned task to a corresponding second computing device (at least Abstract; col. 10, line 28 through col. 11, line 42; Figures 1A-B, 2-3, 8 [Wingdings font/0xE0] Zhu teaches a search engine crawler that includes a distributed set of URL schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Further, Zhu teaches that after deleting URLs (using various methods of filtering) from the URL list 726 (of URLs to be crawled), the page importance scores for the remaining URLs (to be crawled) are used to compute 808 priority scores using a priority score function 730. In this way, Zhu organizes the list of URLs to be crawled by importance of the page to be crawled. Further, Zhu teaches a distributed computing system utilizing a number of different servers and crawlers).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang and Finch as all of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf, Yang and Finch provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf, Yang, Finch and Zhu fail to explicitly teach:
generate a user interface to be presented on a display, wherein the user interface indicates the list of tasks to be performed for each of the plurality of data source and a status for the list of tasks.
However, Sigursson teaches:
generate a user interface to be presented on a display, wherein the user interface indicates the list of tasks to be performed for each of the plurality of data source and a status for the list of tasks (at least pp. 32-42, Chapters 7-8 [Wingdings font/0xE0] Sigursson teaches the “Heritrix” web crawler which includes a web console (via the Console Tab) that is generated so as to start and monitor a crawl (bottom, p. 32). The web console includes a Crawler Status Box (see 7.1.1), a Job Status Box (see 7.1.2) which provide a user with information about the jobs (e.g. tasks) being performed for each crawl).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Sigursson with those of Graf, Yang, Finch and Zhu as all of these inventions are related to the extraction of data from documents. Adding the teaching of Sigursson to Graf, Yang and Finch provides the combination with a user interface that keeps the user up-to-date as to the progress and status of a crawling operation.
Graf further teaches:
the second computing device comprising: a second memory; and a second processor communicatively coupled to the second memory and configured to (at least p. 3, [0053]; Figure 27 [Wingdings font/0xE0] Graf describes a client/server system):
Graf, Yang and Finch fail to explicitly teach:
execute the assigned task to navigate the corresponding data source to the respective site.
However, Zhu teaches:
execute the assigned task to navigate the corresponding data source to the respective site (at least Abstract; col. 3, line 21 though col. 4, line 47; Figures 1B, 2 [Wingdings font/0xE0] Zhu generally teaches a search engine crawler(s) 208 (aka “robots”, “bots”) having document identifiers (e.g., URLs) corresponding to documents on a network. The crawling process navigates to each assigned URL and is managed by schedulers. Documents referenced by the document identifiers are then retrieved and delivered to content processing servers 210 which perform several tasks. The crawled information is then stored in various log files).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf, Yang and Finch as all of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf, Yang and Finch provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf, Yang, Finch, Zhu and Sigursson fail to explicitly teach:
and extract the demographic information from the respective site based on the assigned task; and transmit the extracted demographic information to the first computing device, wherein upon receipt of the extracted demographic information, parse the extracted demographic information into separate categories; and store the parsed demographic information in separate databases based on the separate categories.
However, Liu teaches:
and extract the demographic information from the respective site based on the assigned task; and transmit the extracted demographic information to the first computing device, wherein upon receipt of the extracted demographic information, parse the extracted demographic information into separate categories; and store the parsed demographic information in separate databases based on the separate categories (at least p. 1825, Abstract; p. 1826-1828; Methods; pp. 1830-1833, Results [Wingdings font/0xE0] Liu describes an integrated biomedical Natural Language Processing (NLP) pipeline that automatically extracts a comprehensive set of patient demographics and medical information from on-line health forums. The pipeline can be adopted to construct structured personal health profiles from unstructured user-contributed content on eHealth social media sites. Further, Liu teaches determining sentence classes including demographic information (DMO and its subclasses, see Table 1). The output from the NLP is a structured XML-formatted patient health profile(s)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Liu with those of Graf, Yang, Finch, Zhu and Sigursson as all these inventions are related to the extraction of data from documents. Adding the teaching of Liu to Graf, Yang, Finch, Zhu and Sigursson provides the combination with NPL methods by which patient demographic and other information may be located, extracted and stored

Response to Arguments
Regarding the previous rejection of independent claim 1 (and similarly independent claim 10), Applicant has amended each of these claims as indicated for claim 1 below:

1.	A computer-implemented method for extracting data from a plurality of data sources, comprising:
generating a decision tree for each of the plurality of data sources, wherein the decision tree specifies one or more paths from a base website of a data source of the plurality of data sources to respective other sites of the data source;
generating, based on the decision tree, a list of tasks corresponding each of the plurality of data sources, wherein each task includes instructions for how to extract demographic information corresponding to a respective site corresponding to the one of the one or more paths;
generating a user interface to be presented on a display, wherein the user interface indicates a the list of tasks to be performed for each of the plurality of data sources and a status for the list of tasks
navigating within the corresponding data source from the base website to the respective site as specified by a specified path;
parsing demographic information from the respective site into separate categories; and
storing the parsed demographic information in separate databases based on the separate categories.

Applicant argues that the prior arts of Graf, Yang, Finch, Zhu, and Liu fail to teach:

:... generating a user interface to be presented on a display, wherein the user interface indicates a the list of tasks to be performed for each of the plurality of data sources and a status for the list of tasks;
navigating within the corresponding data source from the base website to the respective site as specified by a specified path;

[underlining indicates added text; deletions for clarity].

Support for the amendments is found in at least Applicant’s specification paragraphs [0048] (referencing FIGS. 1-2), reproduced below:

[0048] In some embodiments, controller 220 may also generate a user interface presented on a display 230. For example, the user interface may indicate a color code indicator of the priority level of a data source 105, the number of tasks for each data source 105, an identification number of data source 105, the number of data extractors 210 performing tasks on each data source 105, a progress indicator of the tasks for each data source 105 (e.g., a percentage of jobs completed, whether data extractors 210 have started or completed the tasks, etc.), and an overall status of the tasks (e.g., "none," "executing," "initialized," "completed," etc.). Using the user interface, an administrator may pause one or more data extractors 210 performing tasks on data source 105 and/or change the priority level of a data source 105. In some embodiments, the user interface may be updated in predetermined intervals, e.g., every 15 minutes, every hour, etc. 

The Examiner has added the prior art of Sigursson et al. Sigursson describes the web crawler application Heritrix. Heritrix, among other teachings, describes a browser-based user interface (Web Console) that displays the status of jobs (see Chapters 7 and 8).

Further regarding claims 4, 13, and 20, the claims are amended, as exemplified by claim 4, to include the limitation of

wherein the user interface further indicates a color code indicator of a priority level of each of the plurality of data sources.

The support for the amendment is found in at least Applicant’s specification paragraph [0048], as filed.
	The Examiner has added the prior art of Gupta et al. which teaches a user interface that uses color coding to indicate task priorities.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James H Blackwell whose telephone number is (571)272-4089. The examiner can normally be reached M-F 05:30AM - 01:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached on 571-272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/James H. Blackwell/
12/13/2022

/CESAR B PAULA/Supervisory Patent Examiner, Art Unit 2177