DETAILED ACTION
	This Office Action is in response to an original application filed 10/30/2019.
	Claims 1-20 are pending.
	Claims 1, 10 and 19 are independent claims.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities:
Page 8	[0041]		scouter 105 should be scouter 205;

Appropriate correction is required.

Claim Objections
Claim 6 is objected to because of the following informalities:  claim 6 should be amended as indicated below.

Appropriate correction is required.

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Graf et al. (hereinafter Graf, U.S. Patent Application Publication No. 2006/0242180 A1, filed 07/23/2004, published 10/26/2006) in view of Uzun et al. (hereinafter Uzun, “An Effective and Efficient Web Content Extractor for Optimizing the Crawling Process,” Softw. Pract. Exper. 2014; 44:1181–1199), and in further view of Zhu et al. (hereinafter Zhu, U.S. Patent No. 8,042,112 B1, filed 06/30/2004, issued 10/18/2011), and in further view of Liu et al. (hereinafter Liu, “Extracting Patient Demographics and Personal Medical Information from Online Health Forums,” AMIA ... Annual Symposium proceedings. AMIA Symposium 2014 (2014): 1825-34).
Regarding independent claim 1, Graf teaches:
A computer-implemented method for extracting data from a plurality of data sources (at least Abstract [Wingdings font/0xE0] Graf teaches a process, system and workflow for extracting and warehousing data from semi-structured documents in any language), comprising:
generating a decision tree for each of the plurality of data sources, wherein the decision tree specifies one or more paths from a base site of a data source of the plurality of data sources to respective sites of the data source (at least p. 1, [0004]; p. 4, [0064]; p. 5, [0078]-[0079], [0081]; pp. 10-11, [0118]-[0120]; pp. 10-12, [0124]-[0148]; p. 13, [0167]-[0169]; p. 14, [0172]; Figures 3, 5-6, 8, 34, 36, 40 [Wingdings font/0xE0] Graf teaches the building (generation) of “text mining term models” that may be constructed for documents either automatically or by building highly specific decision trees using a wizard (see [0064]));
Graf fails to explicitly teach:
generating, based on the decision tree, a list of tasks corresponding each of the plurality of data sources, wherein each task corresponds to a respective one of the one or more paths.
However, Uzun teaches:
generating, based on the decision tree, a list of tasks corresponding each of the plurality of data sources, wherein each task corresponds to a respective one of the one or more paths (at least p. 1181-1184, Summary, 3.1 Blocks; Section 5; Figure 1 [Wingdings font/0xE0] Uzun teaches a focused web crawler (iCrawler; intelligent crawler) that performs Web content extraction that includes learning which HTML tags refer to which blocks (e.g., menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages using decision tree learning algorithm. The decision tree automates pattern extraction by discovering patterns as rules. These rules are then used to extract the desired information from Web pages).
Uzun with those of Graf as both inventions are related to the extraction of data from documents. Adding the teaching of Uzun to Graf provides Graf with an efficient method of web crawling and data extraction that only extracts certain types of blocks from web documents.
Note: the Examiner suggests adding the content within brackets.

Graf and Uzun fail to explicitly teach:
based on a priority level of the corresponding data source, selecting a task from the list of tasks such that the [selected] task is less likely to be selected when another task corresponding to the task’s data source has been selected recently.
However, Zhu teaches:
based on a priority level of the corresponding data source, selecting a task from the list of tasks such that the [selected] task is less likely to be selected when another task corresponding to the task’s data source has been selected recently (at least Abstract; col. 10, line 28 through col. 11, line 42; Figures 1A-B, 2-3, 8 [Wingdings font/0xE0] Zhu teaches a search engine crawler that includes a distributed set of URL schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Further, Zhu teaches that after deleting URLs (using various methods of filtering) from the URL list 726 (of URLs to be crawled), the page importance scores for the remaining URLs 808 priority scores using a priority score function 730. In this way, Zhu organizes the list of URLs to be crawled by importance of the page to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf and Uzun fail to explicitly teach:
navigating within the corresponding data source from the base site to the respective site as specified by the specified path.
However, Zhu teaches:
navigating within the corresponding data source from the base site to the respective site as specified by the specified path (at least Abstract; col. 3, line 21 though col. 4, line 47; Figures 1B, 2 [Wingdings font/0xE0] Zhu generally teaches a search engine crawler(s) 208 (aka “robots”, “bots”) having document identifiers (e.g., URLs) corresponding to documents on a network. The crawling process navigates to each assigned URL and is managed by schedulers. Documents referenced by the document identifiers are then retrieved and delivered to content processing servers 210 which perform several tasks. The crawled information is then stored in various log files).
Zhu with those of Graf and Uzun as all three inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf, Uzun and Zhu fail to explicitly teach:
parsing demographic information from the respective site into separate categories; storing the parsed demographic information in separate databases based on the separate categories.
However, Liu teaches:
parsing demographic information from the respective site into separate categories; storing the parsed demographic information in separate databases based on the separate categories (at least p. 1825, Abstract; p. 1826-1828; Methods; pp. 1830-1833, Results [Wingdings font/0xE0] Liu describes an integrated biomedical Natural Language Processing (NLP) pipeline that automatically extracts a comprehensive set of patient demographics and medical information from on-line health forums. The pipeline can be adopted to construct structured personal health profiles from unstructured user-contributed content on eHealth social media sites. Further, Liu teaches determining sentence classes including demographic information 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Liu with those of Graf, Uzun and Zhu as all these inventions are related to the extraction of data from documents. Adding the teaching of Liu to Graf, Uzun and Zhu provides the combination with NPL methods by which patient demographic and other information may be located, extracted and stored.

Regarding dependent claim 2, Graf and Uzun fail to explicitly teach:
the navigating comprises iteratively accessing the respective site for a predetermined number of attempts when the corresponding data source or respective site is initially inaccessible.
However, Zhu teaches:
the navigating comprises iteratively accessing the respective site for a predetermined number of attempts when the corresponding data source or respective site is initially inaccessible (at least col. 6, lines 15-43; col. 8, line 45 through col. 9, line 29; Figures 4B, 5, 7-8 [Wingdings font/0xE0] Zhu teaches a history log file 218 containing a set of history log records 432. Each history log record 432 contains fields that keep a record of the success/failure of a crawl attempt. Further, Zhu teaches a hash map 504 that maps a URL fingerprint (FP) to a corresponding URL record 508 in a URL status file 506. The URL records 508 include a page importance score, a prior crawl status, and a segment ID. The prior crawl status can include multiple fields, including an error field and an unreachable field. The 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that keep track of errors encountered when attempting to crawl a given document.

Regarding dependent claim 3, Graf and Uzun fails to explicitly teach:
receiving an error notification when the corresponding data source or the respective site is inaccessible after completing the predetermined number of attempts.
However, Zhu teaches:
receiving an error notification when the corresponding data source or the respective site is inaccessible after completing the predetermined number of attempts (at least col. 6, lines 15-43; col. 8, line 45 through col. 9, line 29; Figures Zhu teaches a history log file 218 containing a set of history log records 432. Each history log record 432 contains fields that keep a record of the success/failure of a crawl attempt. Further, Zhu teaches a hash map 504 that maps a URL fingerprint (FP) to a corresponding URL record 508 in a URL status file 506. The URL records 508 include a page importance score, a prior crawl status, and a segment ID. The prior crawl status can include multiple fields, including an error field and an unreachable field. The error field records information associated with a download error (e.g. HTTP Error 4xx) which may indicate that a web page does not exist, or that access is not authorized, or some other error. The error field indicates the number of consecutive times an attempt to download the URL resulted in an error. The unreachable field records information associated with a URL being unreachable (e.g., because the host server is busy). For example, the unreachable field can include the number of consecutive times the URL was unreachable in previous crawls. The segment ID identifies the crawl segment associated with the URL FP at the time that the document download operation was performed or attempted).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that keep track of errors encountered when attempting to crawl a given document.

Regarding dependent claim 4, Graf and Uzun fail to explicitly teach:
selecting the task further comprises assigning the task from a randomly selected data source from among data sources having a same priority level.
However, Zhu teaches:
selecting the task further comprises assigning the task from a randomly selected data source from among data sources having a same priority level (at least col. 10, line 28 through col. 11, line 42; Figure 8 [Wingdings font/0xE0] Zhu teaches a URL scheduling process that schedules the crawling of URLs After computing 808 priority scores for the URLs, the URLs are sorted 810 by priority score and the top N sorted URLs are selected 812 as candidates to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that determine in what order URLs to sites are to be crawled.

Regarding dependent claim 5, Graf and Uzun fail to explicitly teach:
managing a plurality of data extractors performing tasks on each of the plurality of data sources.
However, Zhu teaches:
managing a plurality of data extractors performing tasks on each of the plurality of data sources (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 6, Graf and Uzun fail to explicitly teach:
Note: please correct the typographical error in this claim.

managing the plurality .
However, Zhu teaches:
managing the plurality of data extractors comprises managing a maximum number of data extractors performing tasks on each of the plurality of data sources (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 7, Graf and Uzun fail to explicitly teach:
when the maximum number of data extractors for a first data source of the plurality of data sources is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a same priority level as the first data source.
However, Zhu teaches:
when the maximum number of data extractors for a first data source of the plurality of data sources is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a same priority level as the first data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 8, Graf and Uzun fail to explicitly teach:
when the maximum number of data extractors for a first data source is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a different priority level as the first data source.
However, Zhu teaches:
when the maximum number of data extractors for a first data source is reached, the method further comprises assigning tasks of a second data source of the plurality of data sources having a different priority level as the first data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding dependent claim 9, Graf and Uzun fail to explicitly teach:
managing the number of data extractors comprises periodically adjusting the number of data extractors performing tasks on the corresponding data source.
However, Zhu teaches:
managing the number of data extractors comprises periodically adjusting the number of data extractors performing tasks on the corresponding data source (at least col. 3, line 24 through col. 5, line 6; col. 6, line 57 through col. 7, line 62; Figures 1A-B, 2-3, 4C, 8 [Wingdings font/0xE0] Zhu teaches that a URL server 206 requests URLs from the URL managers 204, 304 (responsible for managing the distribution of URLs to URL server 306) to provide the URL server 206 with URLs obtained from data structure 100. The URL server 206 then distributes URLs from the URL managers 204 to crawlers 208 (hereinafter also called “robots” or “bots”) to be crawled).
Zhu with those of Graf and Uzun as all three of these inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with efficient crawling methods that manage a number of crawlers that are each tasked with crawling a list of URLs referencing sites from which to extract information.

Regarding claims 10-18, claims 10-18 merely recite a non-transitory program storage device for storing the method of claims 1-9, respectively. Thus, Graf in view of Uzun, Zhu and Liu teach every limitation of claims 10-18, and provides proper motivation, as indicated in the rejections of claims 1-9.

Regarding independent claim 19, Graf teaches:
A system comprising:
a first computing device comprising: a first memory; and a first processor communicatively coupled to the first memory and configured to:
generate a decision tree for each of a plurality of data sources, wherein the decision tree comprises one or more paths to respective sites of each of the plurality of data sources (at least p. 1, [0004]; p. 4, [0064]; p. 5, [0078]-[0079], [0081]; pp. 10-11, [0118]-[0120]; pp. 10-12, [0124]-[0148]; p. 13, [0167]-[0169]; p. 14, [0172]; Figures 3, 5-6, 8, 34, 36, 40 [Wingdings font/0xE0] Graf teaches the building (generation) of “text mining term models” that may be constructed decision trees using a wizard (see [0064])
Graf fails to explicitly teach:
generate a list of tasks for each of the plurality of data sources based on the decision tree, wherein each task corresponds to a respective one of the one or more paths and comprises instructions for extracting demographic information from the respective site
However, Uzun teaches:
generate a list of tasks for each of the plurality of data sources based on the decision tree, wherein each task corresponds to a respective one of the one or more paths and comprises instructions for extracting demographic information from the respective site (at least p. 1181-1184, Summary, 3.1 Blocks; Section 5; Figure 1 [Wingdings font/0xE0] Uzun teaches a focused web crawler (iCrawler; intelligent crawler) that performs Web content extraction that includes learning which HTML tags refer to which blocks (e.g., menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages using decision tree learning algorithm. The decision tree automates pattern extraction by discovering patterns as rules. These rules are then used to extract the desired information from Web pages).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the Uzun with those of Graf as both inventions are related to the extraction of data from documents. Adding the teaching of Uzun to Graf provides Graf with an efficient method of web crawling and data extraction that only extracts certain types of blocks from web documents.
Graf and Uzun fail to explicitly teach:
assign a task from the list of tasks to a second computing device based on a priority level of the corresponding data source; and transmit the assigned task to a corresponding second computing device;
However, Zhu teaches:
assign a task from the list of tasks to a second computing device based on a priority level of the corresponding data source; and transmit the assigned task to a corresponding second computing device (at least Abstract; col. 10, line 28 through col. 11, line 42; Figures 1A-B, 2-3, 8 [Wingdings font/0xE0] Zhu teaches a search engine crawler that includes a distributed set of URL schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Further, Zhu teaches that after deleting URLs (using various methods of filtering) from the URL list 726 (of URLs to be crawled), the page importance scores for the remaining URLs (to be crawled) are used to compute 808 priority scores using a priority score function 730. Zhu organizes the list of URLs to be crawled by importance of the page to be crawled. Further, Zhu teaches a distributed computing system utilizing a number of different servers and crawlers).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf further teaches:
the second computing device comprising: a second memory; and a second processor communicatively coupled to the second memory and configured to (at least p. 3, [0053]; Figure 27 [Wingdings font/0xE0] Graf describes a client/server system):
Graf and Uzun fail to explicitly teach:
execute the assigned task to navigate the corresponding data source to the respective site.
However, Zhu teaches:
execute the assigned task to navigate the corresponding data source to the respective site (at least Abstract; col. 3, line 21 though col. 4, line 47; Figures 1B, 2 [Wingdings font/0xE0] Zhu generally teaches a crawler(s) 208 (aka “robots”, “bots”) having document identifiers (e.g., URLs) corresponding to documents on a network. The crawling process navigates to each assigned URL and is managed by schedulers. Documents referenced by the document identifiers are then retrieved and delivered to content processing servers 210 which perform several tasks. The crawled information is then stored in various log files).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhu with those of Graf and Uzun as all three inventions are related to the extraction of data from documents. Adding the teaching of Zhu to Graf and Uzun provides the combination with methods where the order by which URLs are crawled is determined by the importance of the document referenced by the URL.
Graf, Uzun and Zhu fail to explicitly teach:
and extract the demographic information from the respective site based on the assigned task; and transmit the extracted demographic information to the first computing device, wherein upon receipt of the extracted demographic information, parse the extracted demographic information into separate categories; and store the parsed demographic information in separate databases based on the separate categories.
However, Liu
and extract the demographic information from the respective site based on the assigned task; and transmit the extracted demographic information to the first computing device, wherein upon receipt of the extracted demographic information, parse the extracted demographic information into separate categories; and store the parsed demographic information in separate databases based on the separate categories (at least p. 1825, Abstract; p. 1826-1828; Methods; pp. 1830-1833, Results [Wingdings font/0xE0] Liu describes an integrated biomedical Natural Language Processing (NLP) pipeline that automatically extracts a comprehensive set of patient demographics and medical information from on-line health forums. The pipeline can be adopted to construct structured personal health profiles from unstructured user-contributed content on eHealth social media sites. Further, Liu teaches determining sentence classes including demographic information (DMO and its subclasses, see Table 1). The output from the NLP is a structured XML-formatted patient health profile(s)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Liu with those of Graf, Uzun and Zhu as all these inventions are related to the extraction of data from documents. Adding the teaching of Liu to Graf, Uzun and Zhu provides the combination 

Regarding claim 20, claim 20 merely recites a system to execute the method of claim 4. Thus, Graf, Uzun and Zhu teach every limitation of claim 20, and provides proper motivation, as indicated in the rejection of claim 4.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James H Blackwell whose telephone number is (571)272-4089. The examiner can normally be reached M-F 04:30AM - 12:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached on 571-272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For 




/James H. Blackwell/
02/24/2022

/CESAR B PAULA/Supervisory Patent Examiner, Art Unit 2177