DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 12/13/2021 have been fully considered but they are not persuasive. The amended claim 1 teaches the following crawl flow:
Receive a URI;
Retrieve source information (i.e., pool of data elements) of the URI of (a);
Determine if a data element of (b) is relevant;
Analyze the relevant data element of (c) to determine its importance factor;
Assign a chronological score to the relevant data element of (d); and
Crawl the relevant data element of (e) based on the chronological score.
According to the instant specification [0036]-[0037], the source information associated with a URI includes a pool of data elements from the user-viewable hypertext document, e.g., CSS, content of web page such as text, a hyperlink, or metadata of data element. Since none of these data elements are directly found in the URI of step (a), Examiner interprets step (b) to include first downloading (i.e., crawling) the page associated with the URI, from which these data elements are extracted (i.e., retrieved). It should be noticed that data elements crawled in step (f) are not the same as the URI of step (a). Instead they are hyperlink data elements found in the page.
Najork does not teach steps (d)-(f). To the contrary, Najork teaches each of the steps (a)-(f) as follows:
Receive a URI;
Najork teaches a web crawler system. When a new URL (i.e., URI) is discovered (i.e., received), it is added to a priority queue based on a predetermined policy (6:43-46).
Retrieve source information (i.e., pool of data elements) of the URI of (a);
Najork downloads (i.e., retrieves) the page (i.e., source information) associated with the top URL of the priority queues (i.e., URI of step (a)) (8:66-9:2). The page is processed by various modules to extract information, including URLs contained in the page (1:48-64).
Determine if a data element of (b) is relevant;
Najork extracts and stores information about the page of step (b), such as information of a specific type in the page. Najork also determines if the page has been indexed, if the page has changed by more than a threshold amount (1:48-64), or if a URL contained in the page has been visited (8:10-15). If such a qualifier condition is satisfied (i.e., relevance factor), the corresponding data element should be crawled further (i.e., relevant) (1:48-64).
Analyze the relevant data element of (c) to determine its importance factor;
When the identified relevant data element of step (c) is a URL, Najork adds it to one of the priority queues (1:50-53). Najork prioritizes the to-be-crawled URLs to maximize the perceived accuracy or quality (i.e., importance factor) of the pages from crawling, by 
Assign a chronological score to the relevant data element of (d); and
Najork maintains a parallel set of priority queues of to-be-crawled URLs, each associated with a distinct priority level (4:3-11). After downloading a page, Najork extracts relevant URLs (i.e., relevant data elements) from the page and adds them to the priority queues (1:48-64). A newly discovered relevant URL is assigned a priority level (i.e., chronological score) based on properties (i.e., importance factor of step (d)) of the URL or the page in which the URL was found (4:18-22).
Crawl the relevant data element of (e) based on the chronological score.
Najork repeatedly selects the top URL from the priority queues in the priority order from high to low (i.e., chronological score of step (e)), and downloads (i.e., crawls) and processes the associated page (2:57-62).
Applicant further states (pp. 10) that Najork does not teach steps (a)-(f) sequentially or as an ordered combination.
Najork’s pipeline (fig. 5) involves 3 phases in order: crawl (#200-#202), extract (#204-#212), and store (#214 & #220). The crawl phase starts with selecting the top URL (i.e., step (a)) from the priority queues, followed by downloading the associated page (i.e., step (b)). The extract phase starts with extracting a URL from the page, followed by determining if it needs to be crawled further (i.e., step (c)). If so its crawl priority is determined (i.e., step (d)-(e)). The store phase adds the prioritized URLs to the priority queues (i.e., step (f)). In other words, even though Najork does not explicitly number the steps sequentially, every step depends logically Najork teaches the crawl flow of claim 1 in the same order.
Applicant further states (pp. 11) that Najork does not limit crawling to the subset of relevant data elements. Instead, Najork assigns priority to all data elements. This is not correct.
Najork selects the top URL from the priority queues, and downloads the associated page. Najork extracts every URL from the page, and determines if it needs to be crawled further (i.e., relevant), e.g., if it has been visited (8:10-15). In other words, only a subset of URLs from the page will be crawled further, thus reducing processing cost.
Najork then assigns priority to those relevant URLs to maximize the perceived accuracy or quality of the pages from crawling, by preferring pages from web servers known for high quality content (3:1-9), or pages whose content is known to change rapidly such as news sites (3:13-15). The purpose of prioritization before crawling is to ensure the overall quality of downloaded pages for downstream applications by delaying the crawling of low quality pages (2:63-3:12).
Applicant further states (pp. 11) that Najork determines relevancy and prioritization of URLs in two separate steps. This is correct, and they correspond respectively to steps (c) and (d)-(e) in the crawl flow of claim 1.
Applicant further states (pp. 12) that, according to the claimed subject matter, “the corresponding to the URL are downloaded first and then a priority is assigned to the URLs”. Examiner interprets this quoted statement to mean that the page associated with the URI of step (a) is downloaded first (i.e., step (b)), and then a priority is assigned to each relevant URL found in the page (i.e., step (d)). In other words, the newly discovered URLs are assigned before being crawled/downloaded further (i.e., step (f)), which is the same ordering as Najork.
Applicant gives an example (pp. 12) of 1000 data elements, of which only 15 are relevant. According to the crawl flow (a)-(f) of claim 1, these 1000 data elements cannot be found directly from the URI of step (a). Instead, they have to be retrieved from the page associated with the URI, meaning that the page has to be downloaded first, from which the 1000 data elements are extracted (i.e., step (b)) and evaluated to identify the 15 relevant data elements (i.e., step (c)), without actually downloading the 1000 data elements. The 15 relevant data elements are then prioritized and added to the priority queues. Thus Najork does not consume more processing power than claim 1.
Applicant further states (pp. 13) that Najork contradicts the claim 1 ordering of downloading followed by determining relevancy followed by assigning chronology. As explained previously, both claim 1 and Najork teach the same ordering without contradiction.
In summary, Najork teaches the argued limitations of independent claims 1, 10 and 16.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-5, 8-10, 12 and 15-16 are rejected under 35 U.S.C. 102(1) as being anticipated by Najork et al. US patent 6,351,755 [herein “Najork.
Claim 1 recites “A system for aggregating data elements from a pool of data elements associated with at least one Uniform Resource Identifier using a crawler, wherein the system includes a computer system for executing data processing tasks, wherein the system comprises: a data processing arrangement comprising a communication interface for accessing a wide area computer network and the crawler, wherein the crawler is configured to: (a) receive the at least one Uniform Resource Identifier;”
Najork teaches a web crawler system. When a new URL (i.e., Uniform Resource Identifier or URI) is discovered (i.e., received), it is added to one of the priority queues based on a predetermined policy (6:43-46).
Claim 1 further recites “(b) retrieve source information associated with the at least one Uniform Resource Identifier, wherein the source information includes the pool of data elements;”
Najork downloads (i.e., retrieves) the page (i.e., source information) associated with the top URL of the priority queues (i.e., URI of step (a)) (8:66-9:2). The page is processed by various modules to extract information (i.e., pool of data elements), including URLs contained in the page (1:48-64).
Claim 1 further recites “(c) determine at least one relevant data element from the pool of data elements, wherein determining the at least one relevant data element includes: identifying at least one attribute associated with each data element in the pool of the data elements, wherein the at least one attribute associated with each data element includes any one of: a type associated with each data element and/or a feature associated with each data element, wherein the type describes a category to which the data element belongs, and the feature describes a characteristic of the data element;”
Najork extracts (i.e., identifies) and stores information (i.e., attributes) about the page of step (b). Examples of various data collected about the page include its MIME type (i.e., category) and size (i.e., characteristic), date/time and duration of the download, date/time of last modification and expiration, etc. (1:65-2:4).
Claim 1 further recites “evaluating any one of: the type associated with each data element and/or the feature associated with each data element, based on predefined qualifier conditions to generate at least one evaluated attribute associated with each data element, wherein the predefined qualifier conditions include at least one relevant type and at least one relevant feature associated with each data element;”
In Najork, a processing module might look for information of a specific type (i.e., relevant type) in the page. A processing module might also determine if the page has been indexed (i.e., evaluated attribute), or if the page has changed by more than a threshold amount (i.e., relevant feature) (1:48-64), or if a URL contained in the page has been visited (8:10-15).
Claim 1 further recites “determining a relevance factor for each data element based on the generated at least one evaluated attribute associated with each data element, wherein the relevance factor for each data element refers to a condition that determines a relation of the data element, wherein the relation means either relevant or irrelevant; and using the relevance factor to determine the at least one relevant data element from the pool of data elements;”
In Najork, a processing module might look for information of a specific type in the page. A processing module might also determine if the page has been indexed, or if the page has changed by more than a threshold amount, or if a URL contained in the page has been visited. If such a qualifier condition is satisfied, the corresponding data element is considered relevant (i.e., relevance factor) (1:48-64). In particular, when a relevant data element is a URL, it should be crawled further.
Claim 1 further recites “(d) analyze the determined at least one relevant data element to determine an importance factor associated therewith, wherein the importance factor relates to an importance of each relevant data element of the at least one Uniform Resource Identifier;”
When the identified relevant data element of step (c) is a URL, Najork adds it to one of the priority queues (1:50-53). Najork prioritizes the to-be-crawled URLs to maximize the perceived accuracy or quality (i.e., importance factor) of the pages from crawling, by preferring pages from web servers known for high quality content (3:1-9), or pages whose content is known to change rapidly such as news sites (3:13-15).
Claim 1 further recites “(e) assign a chronological score to each of the at least one relevant data element based on the determined importance factor thereof, wherein the chronological score refers to a numerical value that is used to arrange the at least one relevant data element; and”.
Najork maintains (i.e., arranges) a parallel set of priority queues of to-be-crawled URLs, each associated with a distinct priority level (i.e., numerical value) (4:3-11). After downloading a page, Najork extracts relevant URLs (i.e., relevant data elements) from the page and adds them to the priority queues (1:48-64). A newly discovered URL is assigned a priority level (i.e., chronological score) based on properties (i.e., importance factor of step (d)) of the URL or the web page on which the URL was found (4:18-22).
Claim 1 further recites “(f) crawl each of the at least one relevant data element based on the assigned chronological score thereof;”
Najork repeatedly selects the top URL from the priority queues in the order of priority from high to low (i.e., chronological score of step (e)), and downloads (i.e., crawls) and processes the associated page (2:57-62).
Claim 1 further recites “a database arrangement communicably coupled to the data processing arrangement, wherein the database arrangement is configured to aggregate the data elements from the pool of data elements based on the assigned chronological score and the relevance factor associated with each data element.”
Najork’ provides a set of tools for storing an extensible set of data with each URL. These tools enable the processing modules to store a record of information associated with each download, each record being a set (i.e., pool) of name/value pairs (fig. 1, #139; 3:61-67). These records of information are added to a database of processed URLs, from which the download history can be processed (i.e., aggregated) offline (9:31-36), based on download time, last modify time, priority level (i.e., chronological score), etc.
Claims 10 and 16 are analogous to claim 1, and are similarly rejected.

Claim 3 recites “The system of claim 1, wherein the data processing arrangement is configured to generate an agent application.”
Najork uses multiple concurrent threads (i.e., agent applications) to process URLs in a set of priority queues respecting the priority order (4:3-11).

Claim 4 recites “The system of claim 3, wherein the agent application receives the at least one Uniform Resource Identifier.”
Najork uses multiple concurrent threads (i.e., agent applications) to process URLs in a set of priority queues respecting the priority order (4:3-11). Each thread takes the top URL in the priority queues and downloads (i.e., retrieves) the corresponding page (8:66-9:2).

Claim 5 recites “The system of claim 1, wherein the data element includes any one of: hyperlinks, documents, text, metadata associated with the one or more elements.”
In Najork, a processing module of the crawler might look for information (i.e., data elements) of a specific type in the downloaded page, e.g., a URL (i.e., hyperlink) (1:48-64).
Claim 12 is analogous to claim 5, and is similarly rejected.

Claim 8 recites “The system of claim 1, wherein the importance factor is determined based on web content associated with the at least one relevant data element.”
Najork’s crawler prioritizes the to-be-crawled URLs to maximize the perceived accuracy or quality (i.e., importance factor) of downloaded pages (i.e., data elements), by preferring pages (i.e., web content) from web servers with known high quality content (3:1-9), or pages whose content is known to change rapidly such as news sites (3:13-15).
Claim 15 is analogous to claim 8, and is similarly rejected.

Claim 9 recites “The system of claim 1, wherein the database arrangement includes a data storage unit, wherein the data storage unit is configured to aggregate the at least one relevant data element based on the assigned chronological score.”
Najork’s crawler includes a set of tools for storing an extensible set of data with each URL. These tools enable the processing modules to store a record of information associated with each download, each record being a set of name/value pairs (fig. 1, #139; 3:61-67). These records of information are added to a database (i.e., data storage unit) of processed URLs, from which the download history can be processed (i.e., aggregated) offline (9:31-36), based on download time, last modify time, priority level (i.e., chronological score), etc.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Najork as applied to claim 1 above, and further in view of Najork. US patent 7,139,747 [herein “Najork2”].
Claim 2 recites “The system of claim 1, wherein the crawler is implemented in a distributed architecture.”
Najork teaches claim 1, but does not disclose this claim; however, Najork2 achieves efficient crawling by distributing the URLs to be downloaded among a plurality of crawlers interconnected via a network (Najork2: 1:59-66).
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Najork with Najork2. One having ordinary skill in the art would have found motivation to incorporate the distributed architecture of Najork2 into Najork’s crawler system to greatly improve downloading and processing performance.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHELLY X. QIAN whose telephone number is (408)918-7599. The examiner can normally be reached Monday - Friday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tony Mahmoudi can be reached on (571)272-4078. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.







/SHELLY X QIAN/Examiner, Art Unit 2163                                                                                                                                                                                                        



/ALEX GOFMAN/Primary Examiner, Art Unit 2163