Notice of Pre-AIA  or AIA  Status
	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 11-26 are pending.
Claim 20 is amended. 
Claims 1-10 are canceled.


Response to Arguments:

Claim Rejections – 35 USC § 101:
Applicant argues that the claim limitation “for each of a plurality of identified URLs, conditionally processing the identified URL based on a data associated with the identified URL” does not cover performance of the limitation in the mind, and clearly not fall within the “Mental Processes” grouping of abstract ideas.  Examiner respectfully disagrees.
The limitation “for each of the plurality of identified URLs, conditionally processing the identified URL based on data associated with the identified URL”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation processing” in the context of this claim encompasses the user evaluating an identified data based on associated data or user selectively considering a piece of information among other pieces of information.
In addition, the claim recites additional element “processor” to perform “processing” step.  The “processor” in the step is recited at a high-level of generality (i.e., as e generic processor performing a generic computer function of “processing” the identified URL based on data associated with the identified URL).  The additional element does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.

Applicant argues that “More specifically, by conditionally processing identified URLs based on associated data, the processing of such URLs may be optimized, which may improve a performance of hardware performing such processing.  As such, the claim above claim language constitutes an improvement in functioning of a computer”.  Examiner respectfully disagrees.
	
While conditional processing of particular data sets may be beneficial for processing of these data sets, it has no effect 

Claim Rejections – 35 USC § 103:
	Applicant argues that the references do not teach “selecting one of plurality of buckets within a hash table to be reviewed”.  With respect to Applicant’s argument regarding the cited prior art’s disclosure of claim 11, the Examiner respectfully disagrees.  It appears Applicant is attacking the references individually, however, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  As is discussed in the 103 rejection below, Zhu discloses “selecting one of plurality of buckets within a table to be reviewed” (see col. 6, line 38-46). Additionally, Wong discloses “a plurality of buckets within a hash table” (see para. 0050; Fig. 7).  Thus, when combining Wong and Zhu in the manner set forth by the Examiner in the rejection below, such a combination discloses “selecting one of plurality of buckets within a hash table to be reviewed”. Accordingly, based on the foregoing, the combination of Wong/Zhu renders claim 11 unpatentable.

Applicant argues that the references do not teach “identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table”.  With respect to Applicant’s argument regarding the cited prior art’s disclosure of claim 11, the Examiner respectfully disagrees.  It appears Applicant is attacking the references individually, however, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  As is discussed in the 103 rejection below, Zhu discloses “identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the table” (see col. 5, lines 31-37). Additionally, Wong discloses “a plurality of buckets within a hash table” (see para. 0050; Fig. 7).  Thus, when combining Wong and Zhu in the manner set forth by the Examiner in the rejection below, such a combination discloses “identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table”. Accordingly, based on the foregoing, the combination of Wong/Zhu renders claim 11 unpatentable.

Applicant’s argument with respect to amended claim 20 is being considered and moot in view of the new ground(s) rejection below in Claim Rejections - 35 USC § 103.

As to claim 25, Applicant argues that Zhu does not teach “a time since a last crawl”.  Examiner respectfully disagrees.
"Time taken to download" 534 providing an indication of how long it took a robot 208 to download the web page associated with the corresponding URL in the last crawl reads on “a time since a last crawl”.  See col. 14, lines 18-43.

As to claim 25, Applicant argues that Wong-113 does not teach “a total number of pages crawled within the table, a total number of pages not crawled within the table”.  Examiner respectfully disagrees.
	The number of processed pages representing a total number of web pages from the domain processed by the web crawling system reads on “a total number of pages crawled within the table”; and URLs that identify web pages that have not yet been downloaded by the web crawler reads on “a total number of pages not crawled within the table”. See paragraphs [0107] [0003].

Papadimitriou does not teach “comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the table, and processing the URL in response to determining that the overall score exceeds the threshold score”.  Examiner respectfully disagrees.
The values of the URL dominance score that are greater than 0.5 indicating a relatively high level of URL dominance reads on “comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the table…processing the URL in response to determining that the overall score exceeds the threshold score”.  See paragraph [0083].



Claim Rejections – 35 USC § 101


35 U.S.C. 101 reads as follows: 

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. 


Claims 11-26 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Independent claim 1 recites selecting one of a plurality of buckets within a hash table to be reviewed; identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table; and for each of the plurality of identified URLs, conditionally processing the identified URL based on data associated with the identified URL.
The limitation of selecting one of a plurality of buckets within a hash table to be reviewed, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is that nothing in the claim element precludes the step from practically being performed in the mind. The term “selecting” in the context of this claim encompasses user paying an attention upon one of multiple pieces of information in front of him.
The limitation of identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is that nothing in the claim element precludes the step from practically being performed in the mind. The term “identifying” in the context of this claim encompasses user looking at multiple pieces of information in a certain area in front of him.
for each of the plurality of identified URLs, conditionally processing the identified URL based on data associated with the identified URL, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, nothing in the claim element precludes the step from practically being performed in the mind. The term “processing” in the context of this claim encompasses the user evaluating an identified data based on associated data or user selectively considering a piece of information among other pieces of information.
If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim recites additional elements 1) by the processor in selecting, identifying, and processing steps.  The processor is recited at a high-level of generality (i.e., as a generic computing device performing a generic computer function of obtaining data) such that amount no more than mere instructions to apply the exception using a generic computer component.  Accordingly, this additional 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a processor to perform the selecting, identifying, and processing steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.  The claim is not patent eligible.

Dependent claim 12 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 12 recites wherein the plurality of identified URLs are each represented only once within the hash table.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 13 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 13 recites wherein the data associated with the identified URL includes metadata stored in the hash table in association with a digest representing the identified URL.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 14 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 14 recites analyzing the web page found at the identified URL, extracting textual data from the web page, adding new identified link URLs to the hash table; adding to a count of identified link URLs already existing within the hash table, saving extracted data to an associated digest in the selected bucket, updating metadata associated with the associated digest in the selected bucket.  These limitations are a process that, under its broadest reasonable interpretation, covers performance of the limitations 

Dependent claim 15 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 15 recites wherein the data associated with the identified URL includes a page score for the identified URL that is saved in the hash table in association with the identified URL.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such 

Dependent claim 16 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 15 recites wherein the one of plurality of buckets of the hash table contains a plurality of digests.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 17 dependent on claim 16 includes all of the limitations of claim 16; therefore, the claim recites the same abstract idea as the independent claim 16.  Claim 17 recites wherein each of the plurality of digests represents a unique URL.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 18 dependent on claim 16 includes all of the limitations of claim 16; therefore, the claim recites the same abstract idea as the independent claim 16.  Claim 18 recites wherein each of the plurality of digests is stored with data extracted from a web page.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 19 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 19 recites individually determining, for each of the plurality of identified URLs, whether to process the identified URL, based on the data associated with the identified URL along with global information associated with a full set of known URLs.  These limitations are a process that, under its broadest reasonable interpretation, covers performance of the limitations in the mind. Nothing in the claim elements precludes the steps from practically being performed in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance 

Claim 20 are deemed analyzed and discussed with respect to claim 1.

Dependent claim 21 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 21 recites each of the plurality of URLs is represented by a digest utilizing a cryptographic hash function, each of the plurality of buckets within the hash table includes a plurality of digests, digests are distributed randomly among the plurality of buckets within the hash table, each of the plurality of buckets represents an equal range within a digest space, and each digest is linked to data extracted from a web page found at a URL represented by the digest that is stored with the digest in the hash table.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 22 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 22 recites wherein the data associated with the URL includes: a page score for the URL determined based on a link authority, a last crawl time, changes made to a web page associated with the URL since last crawl, a page change rate for the web page associated with the URL, a number of page errors within the web page associated with the URL, a number of attempted page crawls for the web page associated with the URL, and sub-crawl patterns for the web page associated with the URL.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 23 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 23 recites wherein the identified URL is conditionally processed based on the data associated with the identified URL as well as global information associated with a full set of known URLs, the global information including: a time since a last crawl, a total number of web pages crawled within the hash table, a total number of web pages not crawled within the hash table, a percentage of a global crawl goal that is not currently met, and a histogram of past web page scores.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 24 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 24 recites wherein the plurality of buckets within the hash table are prioritized, such that buckets that have been crawled a smaller number of times are prioritized over buckets that have been crawled a greater number of times.  The claims fail to add anything significantly more to the independent claim directed to a judicial exception.  Viewed as a whole, these additional claim elements do not provide meaningful limitations to integrate the abstract idea into a practical application such that the claim amounts to significantly more than the abstract idea itself.

Dependent claim 25 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 25 recites determining an overall score for the identified URL, based on global information associated with a full set of known URLs in addition to metadata for the identified URL, wherein:  AUS1P023/AUS920150439US1- 5 -the global information includes a time since a last crawl, a total number of pages crawled within the hash table, a total number of pages not crawled within the hash table, and a percentage of a global crawl goal that is not currently met, and the metadata for the identified URL includes a page score for the URL based on a link authority for the URL, a last crawl time for the URL, and changes made to a page associated with the URL since a last crawl, comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the hash table, and processing the URL in response to determining that the overall score exceeds the threshold score.  

Dependent claim 26 dependent on claim 11 includes all of the limitations of claim 11; therefore, the claim recites the same abstract idea as the independent claim 11.  Claim 26 recites each bucket represents a range of URLs, and each URL is represented once within the hash table, each of the URLs is stored within the bucket as a digest, each of the URLs is converted into a sequence of bytes represented by the digest utilizing a hash function, the data associated with each of the URLs is linked to its associated digest and is stored with the digest within the hash table the data associated with each of the URLs includes metadata including: a page score for the URL based on a link authority for the URL, a last crawl time for the URL, and changes made to a page associated with the URL since a last crawl, conditionally processing the identified URL includes: determining an overall score for the identified URL, based on global information associated with a full set of known URLs in addition to the metadata for the identified URL, the global information including a time since a last crawl, a total number of pages crawled within the hash table, a total number of pages not crawled within the hash table, and a percentage of a global crawl goal that is not currently met, comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the hash table, and  AUS1P023/AUS920150439US1- 6 -processing the URL in response to determining that the overall score exceeds the threshold score.  These limitations are a process that, under its broadest reasonable interpretation, covers performance of the limitations in the mind. Nothing in the claim elements precludes the steps from practically being performed in the mind. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 11-13, 15-19, 21, 24 are rejected under 35 U.S.C. 103 as being unpatentable over US Pat No 8136025 by Zhu in view of US Pub No 2003/0093645 by Wong et al.

Regarding independent claim 11, Zhu teaches “A computer program product for determining whether to process a uniform resource locator (URL), the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising:”
“selecting one of plurality of buckets within a table to be reviewed” (Zhu: col. 6, line 38-46)
(col. 6, 38-46: Step 302. In step 302 URL scheduler 202 determines which URLs will be crawled in each epoch, and stores that information in data structure 100. Controller 201 selects a segment 112 from base layer 102 (“selecting one of plurality of buckets within a table to be reviewed”) for crawling. The selected segment 112 is referred to herein as the "active segment." Typically, at the start of each epoch, controller 201 selects a different segment 112 from base layer 102 as the active segment so that, over the course of several epochs, all the segments 112 are selected for crawling in a round-robin style.
	The selected segment 112 selected from the base layer 102 reads on “one of plurality of buckets within a table”.  The base layer 102 reads on “a table
“identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the table” (Zhu: col. 5, lines 31-37)
(col. 5, 31-37: Data structure for storing URLs. Referring to FIG. 1, a three-layer data structure 100 is illustrated. Base layer 102 (“the table”) of data structure 100 comprises a sequence of segments 112. In one embodiment, each segment 112 comprises more than two hundred million uniform resource locations (URLs) (“identifying a plurality of uniform resource locators (URLs) stored within the selected bucket”). Together, segments 112 represent a substantial percentage of the addressable URLs in the entire Internet.)
“for each of the plurality of identified URLs, conditionally processing the identified URL, based on data associated with the identified URL” (Zhu: col. 6, lines 38-46; col. 10, lines 5-12, lines 25-41)
(col. 5, lines 38-46: Periodically (e.g., daily) (“conditionally”) one of the segments 112 is deployed for crawling (“processing”) purposes, as described in more detail below. In addition to segments 112, there exists a daily crawl layer 104. In one embodiment, daily crawl layer 104 comprises more than fifty million URLs. Daily crawl layer 104 comprises the URLs that are to be crawled more frequently (“processing the identified URL”) than the URLs in segments 112. In addition, 
	The fact that all the URLs of the selected segment are crawled periodically reads on “for each of the plurality of identified URLs, conditionally processing the identified URL”.
col. 10, lines 5-12: The storage of URLs in hash tables 600 on each server hosted by a URL manager 204 is advantageous because it provides a way of quickly accessing URL state information. For example, to obtain state information for a particular URL (“based on data associated with the identified URL”), all that is required is to look up the record having the hash value that corresponds to the hash of the URL. Such a lookup process is more efficient than searching through records of all the URLs held by all the URL managers 204 for a desired URL.
	The state information to be obtained for crawling URL reads on the “data associated with the identified URL”.)
Zhu does not explicitly teach; however, Wong discloses, “a plurality of buckets within a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets (“a plurality of buckets”). In one embodiment, hash table 700 is comprised of multiple buckets (“a plurality of buckets within a hash table”) such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu and Wong are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Wong with the method/system of Zhu in order to provide users with a means for receiving a plurality of objects from an origin server, computing a hash value based on source information about an object, and storing the object based on the hash value with other related objects as shown in (Wong: [0017]).
Zhu in view of Wong teaches, “selecting one of plurality of buckets within a hash table to be reviewed; identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table”.

As to claim 12, Zhu in view of Wong teaches “wherein the plurality of identified URLs are each represented only once within the hash table” [Zhu: col. 6, lines 38-46; Wong: [0050]; Fig. 7].

As to claim 13, Zhu in view of Wong teaches “wherein the data associated with the identified URL includes metadata stored in the hash table in association with a digest representing the identified URL” [Zhu: col. 10, lines 14-24; Wong: [0050]; Fig. 7].

As to claim 15, Zhu in view of Wong teaches “wherein the data associated with the identified URL includes a page score for the identified URL that is saved in the hash table in association with the identified URL” [Zhu: col. 6, lines 47-57; Wong: [0050]; Fig. 7].

As to claim 16, Zhu in view of Wong teaches, “each of the plurality of buckets within the hash table includes a plurality of digests” (Zhu: col. 16, lines 5-11; Wong: [0050]; Fig. 7)

As to claim 17, Zhu in view of Wong teaches, “wherein each of the plurality of digests represents a unique URL” (Zhu: col. 7, lines 35-46; Wong: [0050]; Fig. 7)

As to claim 18, Zhu in view of Wong teaches, “wherein each of the plurality of digests is stored with data extracted from a web page” (Zhu: col. 14, lines 1-7; Wong: [0050]; Fig. 7)

As to claim 19, Zhu in view of Wong teaches “further comprising individually determining, for each of the plurality of identified URLs, whether to process the identified URL, based on the data associated with the identified URL along with global information associated with a full set of known URLs” [Zhu: col. 9, lines 37-64].
	
As to claim 21, Zhu in view of Wong teaches,
“each of the plurality of URLs is represented by a digest utilizing a cryptographic hash function” (Zhu: col. 7, lines 34-46 “A fingerprint is, for example, a 64-bit number (or a value of some other predetermined bit length) that is generated from the corresponding URL by first normalizing the URL text (for example, converting host names to lower case) and then passing the normalized URL through a fingerprinting function that is similar to a hash function with the exception that the fingerprint function guarantees that the fingerprints are well distributed across the entire space of possible numbers.”)
“each of the plurality of buckets within the hash table includes a plurality of digests” (Zhu: col. 16, lines 5-11 “The modulus of the modulus function may be preferably relatively prime with respect to the modulus of the function used to subdivide the layer 900 into segments, or the modulus function used to allocate the segment into partitions may be based on a different subset of the bits of the URL fingerprint than the function used to allocate Wong: [0050]; Fig. 7)
“digests are distributed randomly among the plurality of buckets within the hash table” (Zhu: col. 7, lines 30-35 “In cases where URL scheduler 202 determines that a URL should be placed in a segment 112 of base layer 102, an effort is made to ensure that the placement of the URL into a given segment 112 of base layer 102 is random (or pseudo-random), so that the URLs to be crawled are evenly distributed (or approximately evenly distributed) over the segments.”; Wong: [0050]; Fig. 7)
“each of the plurality of buckets represents an equal range within a digest space” (Zhu: col. 8, lines 61-67 “Typically, this partitioning is performed using a modulo function or similar function on the fingerprint values (or a portion of a fingerprint value) derived from each URL in the active segment and daily layers so as to partition these URLs into a set of approximately equal sets (partitions). Each of these sets is assigned to a different URL manager 204 of a plurality of URL managers 204.”)
“each digest is linked to data extracted from a web page found at a URL represented by the digest that is stored with the digest in the hash table” (Zhu: col. 14, lines 1-7 “Referring to FIG. 5B, an RTlog stores the documents 512 the content 512 of the document, the page rank 514 was assigned to the source URL of the document, the URL fingerprint 516 of the document. The record 510 may optionally include a list of URL fingerprints of duplicate documents having the same content.”; Wong: [0050]; Fig. 7)

As to claim 24, Zhu in view of Wong teaches “wherein the plurality of buckets within the hash table are prioritized, such that buckets that have been crawled a smaller number of times are prioritized over buckets that have been crawled a greater number of times” [Zhu: col. 5, lines 38-46; col. 7, line 54-col. 8, line 3; col. 11, lines 1-29; Wong: [0050]; Fig. 7].

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over US Pat No 8136025 by Zhu in view of US Pub No 2003/0093645 by Wong et al further in view of US Pub No 20110093533 by Kataria et al.

As to claim 14, Zhu-Wong teaches,
“analyzing the web page found at the identified URL” (Zhu: col. 5, lines 16-21)
(col. 5, lines 16-21: The present invention provides systems and methods for crawling and indexing web pages (“analyzing the web page”). Advantageously, these systems and methods reduce the latency between the time when a web page is posted or updated on the Internet and the time when a representation of the new or updated web page is indexed and made available to a search engine.)
“extracting textual data from the web page” (Zhu: col. 5, lines 57-65)
(col. 5, lines 57-65: The URLs in layers 102, 104, and 106 are all crawled (“extracting textual data from the web page”) by the same robots 208 (FIG. 2). However, the results of the crawl (“textual data”) are placed in indexes that correspond to layers 102, 104, and 106 as illustrated in FIG. 2 and described in more detail below. Layers 102, 104, and 106 are populated by a URL scheduler based on the historical (or expected) frequency of change of the content of the web pages at the URLs and a measure of URL importance, as described in more detail below.)
“saving extracted data to an associated digest in the selected bucket” (Zhu: col. 6, lines 38-46)
(col. 6, lines 38-46: Step 302. In step 302 URL scheduler 202 determines which URLs will be crawled in each epoch, and stores that information in data structure 100 (“saving extracted data”). Controller 201 selects a segment 112 from base layer 102 for crawling. The selected segment 112 is referred to herein as the "active segment." (“an associated digest in the selected bucket”) Typically, at the start of each epoch, controller 201 selects a different segment 112 from base layer 102 as the active segment so that, over the course of several epochs, all the segments 112 are selected for crawling in a round-robin style.
“updating metadata associated with the associated digest in the selected bucket” (Zhu: col. 8, lines 4-21)
(Limitation “metadata” is disclosed as “a page score” as in [0062], US Published Specifications.
col. 8, lines 4-21: In embodiments where a crawl score is computed, URL scheduler 202 determines which URLs will be crawled on the Internet during the epoch by computing a crawl score for each URL. Those URLs that receive a high crawl score (e.g., above a predefined threshold) are passed on to the next stage (URL managers 204) whereas those URLs that receive a low crawl score (e.g., below the predefined threshold) are not passed on to the next stage during the given epoch. There are many different factors that can be used to compute a crawl score (“metadata”) including the current location of the URL (active segment 112, daily segment 104 or real-time segment 106), URL page rank, and URL crawl history. URL crawl history is obtained from URL history logs 218. Although many possible crawl scores are possible, in one embodiment the crawl score is computed as: crawl score=[page rank].sup.2*(change frequency)*(time since last crawl) (“updating metadata associated with the associated digest in the selected bucket”).)
Zhu-Wong does not explicitly teach “adding new identified link URLs to the hash table; adding to a count of identified link URLs already existing within the hash table”.
Katrina teaches “adding new identified link URLs to the table; adding to a count of identified link URLs already existing within the table” ([0028] [0029]).
([0028] The URL information reader 220 reads and processes the URL information in the URL information pipe 210 and generates a URL information data structure 230. The URL information data structure 230 can be a hash table. The hash table can be limited by a maximum number of URLs (e.g., 100,000 URLs) or a maximum memory size (e.g., 300 MB of disk space).
[0029] For each unique URL in the URL information pipe 210, the URL information reader 220 can create an entry in the URL information data structure 230 that includes, for example, the URL, a first time the URL was scanned by the module 120 (“adding new identified link URLs to the table”), and one or more counters. For multiple occurrences of a URL in the URL information pipe 210, the URL information reader 220 can increase a first counter that represents the number of times a resource identified by the URL (“adding to a count of identified link URLs already existing within the table”) was served 
Wong discloses, “a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets. In one embodiment, hash table 700 (“a hash table”) is comprised of multiple buckets such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu-Wong and Katrina are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Katrina with the method/system of Zhu-Wong in order to provide users with a means for scanning network traffic between a server and one or more clients requesting resources from the server as shown in [Katrina: para 0006].

Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over US Pat No 8136025 by Zhu in view of US Pub No 2003/0093645 by Wong et al further in view of US Pub No 20130046584 by YU et al.

As to claim 22, Zhu-Wong teaches, 
“a page score for the URL determined based on…a last crawl time, changes made to a web page associated with the URL since last crawl, a page change rate for the web page associated with the URL” (Zhu: col. 7, line 54-col. 8, line3; col. 6, line 60-col. 7, line 11)
(col. 7, line 54-col. 8, line3: In some embodiments, it is not possible to crawl all the URLs in an active segment 112, daily layer 104, and real-time layer 106 during a given epoch. In one embodiment, this problem is addressed using two different approaches. In the first approach, a crawl score is computed for each URL (“a page score for the URL”) in active segment 112, daily layer 104, and real-time layer 106. Only those URLs that receive a high crawl score (e.g., above a threshold value) are passed on to the next stage (URL managers 204, FIG. 2). In the second approach, URL scheduler 202 refines an optimum crawl frequency for each such URL and passes the crawl frequency information on to URL managers 204. The crawl frequency information is then ultimately used by URL managers 204 to 
col. 6, line 60-col. 7, line 11: The mechanism by which URL scheduler 202 obtains URL change frequency data (“a page change rate for the web page associated with the URL”) is best understood by reviewing FIG. 2. When a URL is accessed by a robot 208, the information is passed through content filters 210. Content filters 210, among other things, determine whether a URL has changed and when a URL was last accessed by a robot 208. This information is placed in history logs 218, which are passed back to URL scheduler 202. By reviewing the log records for a particular URL, each of which indicates whether the content of a URL changed since the immediately previous time the URL was crawled (“a last crawl time”), the URL schedule 202 (or other module) can compute a URL change frequency. This technique is particularly useful for identifying URL's whose content (i.e., the content of the page at the URL) changes very infrequently (“changes made to a web page associated with the URL since last crawl”), or perhaps not at all. Furthermore, the computation of a URL change frequency can include using supplemental information about the URL. For instance, the URL 
“… a number of attempted page crawls for the web page associated with the URL, sub-patterns for the web page associated with the URL” (Zhu: col. 7, line 13-21)
(col. 7, line 13-21: A query-independent score (also called a document score) is computed for each URL by URL page rankers 222. Page rankers 222 compute a page rank for a given URL by considering not only the number of URLs that reference a given URL (“a number of attempted page crawls for the web page associated with the URL”) but also the page rank of such referencing URLs. Page rank data can be obtained from URL managers 204. A more complete explanation of the computation of page rank is found in U.S. Pat. No. 6,285,999, which is hereby incorporated by reference as background information.)
Zhu-Wong does not explicitly teach “a page score for the URL determined based on a link authority, a number of page errors within the web page associated with the URL”.
YU teaches,
“a page score for the URL determined based on a link authority” ([0070])
([0070] The recommendations user interface may provide options for viewing recommendations, such as, all web pages, top web pages by ranking or number, or a summary report. As shown in page score (“a page score”) or by highest number of recommendations. For example, the user interface may provide an option to view the "Top 25 Pages to Focus On," which has been selected in the example illustrated in FIG. 15. The top 25 web pages may be displayed by URL and target keyword, for example. For each of the top 25 web pages, the interface may provide any number of recommendations, such as, total search volume, page authority, rank, number (#) of target keywords and number (#) of recommendations. The page authority may be determined using a numeric representation of the web page's global link authority (“based on a link authority”) which may be based on a 100-point, logarithmic scale. For example, the page authority score may be a PAGE AUTHORITY score available from SEOMoz or a CITATION FLOW score available from MajesticSEO.)
“…a number of page errors within the web page associated with the URL” ([0059])
([0059] Performance reporting may also generate a score which is based on a series of metrics (“a page score”), such as estimated traffic to the site, the number of errors on the page a number of page errors”), the number of backlinks to the page, and the like or some combination thereof.)
Zhu-Wong and YU are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of YU with the method/system of Zhu-Wong in order to provide users with a means for optimizing search results for an entity by categorizing a plurality of web pages into a plurality of page types as shown in [YU: para 0009].

Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over US Pat No 8136025 by Zhu in view of US Pub No 2003/0093645 by Wong et al (hereinafter “Wong-645”) further in view of US Pub No 20080104113 by Wong et al (hereinafter “Wong-113”).

As to claim 23, Zhu-Wong-645 teaches, “wherein the identified URL is conditionally processed based on the data associated with the identified URL as well as global information associated with a full set of known URLs”
“the global information including: a time since a last crawl” (Zhu: col. 14, lines 18-43)
col. 14, lines 18-43: Referring to FIG. 5C, a history log 218 comprises a record 520 for each URL 522 that has been crawled by a robot 208. As illustrated in FIG. 5C, there are a wide range of possible fields that can be included in each record 520. One field is crawl status 524. Crawl status 524 indicates whether the corresponding URL 522 has been successfully crawled. Other field is the content checksum 526, also known as the content fingerprint. When pages have identical content, they will also have the same content fingerprint 526. URL scheduler 202 can compare these content fingerprint with a previous content fingerprint obtained for the corresponding URL (identified by URL fingerprint 522 in the history log record 520) on a previous crawl to ascertain whether the web page has changed since the last crawl. Similarly, URL scheduler 202 can use link checksum 530 to determine whether any of the outbound links on the web page associated with the corresponding URL 522 have changed since the last crawl. Source 532 provides an indication of whether robot 208 accessed the URL using the Internet or an internal repository of URLs. "Time taken to download" 534 provides an indication of how long it took a robot 208 to download the web page associated with the corresponding URL in the last crawl (“a time since a last crawl”). Error condition 536 records any errors that were encountered by a robot 208 crawl. An example of an error condition is "HTTP 404", which indicates that the web page does not exist.
“a percentage of a global crawl goal that is not currently met” (Zhu: col. 12, lines 27-55)
(col. 12, lines 27-55: Step 308. In step 308, a plurality of robots 208 crawl URLs that are provided to the robots 208 by URL server 206. In some embodiments, robots 208 use a calling process that requires domain name system (DNS) resolution. DNS resolution is the process by which host names (URLs) are resolved into their Internet Protocol (IP) addresses using a database that provides a mapping between host names (URLs) and IP addresses. In some embodiments, enhancements to known DNS resolution schemes are provided in order to prevent DNS resolution from becoming a bottleneck to the web crawling process, in which hundreds of millions of URLs must be resolved in a matter of hours. One of these enhancements is the use of a dedicated local database 250 (FIG. 2) that stores the IP addresses for URLs that have been crawled by system 200 in the past, which reduces the system's reliance on DNS servers on the Internet. This allows URLs that have been previously crawled by system 200 to be pre-resolved with respect to DNS resolution. The use of a local DNS resolution database 250 enables a high percentage of the system's DNS resolution operations to be handled locally, at very high speed. Only those URLs that are not represented on local DNS database 250 (e.g., because they have not been previously crawled) (“a percentage of a global crawl goal that is not currently met”) are resolved using conventional DNS resources of the Internet. As a result, the IP addresses of URLs are readily accessible when they are needed by a robot 208. Also, the system presents a much lower load on the DNS servers that would otherwise be needed to perform DNS resolution on every URL to be crawled.)
Zhu-Wong-645 does not explicitly teach, “a total number of web pages crawled within the hash table; a total number of web pages not crawled within the hash table; a histogram of past web page scores”.
Wong-113 teaches,
“a total number of web pages crawled” ([0107])
([0107] As described previously, web crawling process 300 may be suitably formatted to generate a plurality of scores for the URL, where each of the scores indicates a different measure related to whether the URL corresponds to a desired web page type, such as a commercial product or a subcategory of commercial products. Again, any number of scoring metrics may be employed by an embodiment of a web crawler system. In this example, process 300 generates a domain density score (task 312) in response to the domain of the URL being analyzed. As explained above in connection with the domain density metric, the number of processed pages represents a total number of web pages from the domain processed by the web crawling system (“a total number of web pages crawled”)). In addition, task 312 may be performed to calculate the domain density score from the ratio using an appropriate algorithm or formula.)
“a total number of web pages not crawled” ([0003])
([0003] Techniques and technologies described herein are applicable for use in connection with a web crawler application. The web crawler application is controlled to download web pages in a targeted and prioritized manner that focuses on at least one designated category or type of web page. The web crawler application employs or is influenced by a suitably configured URL scoring module that makes crawling and indexing Internet documents more efficient, thus enabling such indexing to be performed with less computation and hardware. The URL scoring module achieves higher efficiency by predicting the location of target documents and directing the web crawler application towards these documents. The URL scoring module generates different scores (using different techniques or metrics) for URLs that identify web pages that have not yet been downloaded by the web crawler (“a total number of web pages not crawled”). An overall score or downloading priority is calculated for each URL using at least some of the individual scores for the respective URL. The web crawler application downloads URLs in an order that is influenced by the overall scores.)
“a histogram of past web page scores” ([0030] [0079])
([0030] For this example, URL scoring module 208 is suitably configured to generate a plurality of scores for a new URL, where each individual score is related to a different scoring or ranking metric. These metrics include, without limitation: a domain density metric that results in a domain density score for the URL; an anchor text metric that results in an anchor text score for the URL; a URL string score metric that results in a URL string score for the URL; a link proximity metric that results in a link proximity score for the URL; and a category need metric that results in a category need score for the URL--the category need metric may indicate a predicted category for the web page corresponding to the URL. URL scoring module 208 calculates a downloading priority (an overall score for the URL) from at least some of the individual scores. The downloading priority may, for example, be a simple numerical score. In one embodiment, URL scoring module 208 calculates the downloading priority in response to all of the individual scores by processing the individual scores (“a histogram of past web page scores”) with a suitable algorithm or function. Thereafter, URL scoring module 208 provides the new URLs, along with their respective downloading priorities, to web crawler core module 202, which then downloads the web pages corresponding to the new URLs in an order that is determined by the downloading priorities. This prioritization forces web crawler system 200 to concentrate on web pages of the desired type, category, subcategory, genre, or the like.
[0079] Table 1 shows how a URL score is modulated based on the current page classification and the outgoing link classification. The following terms are utilized in Table 1:...)
Wong-645 discloses, “a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets. In one embodiment, hash table 700 (“a hash table”) is comprised of multiple buckets such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu-Wong-645 and Wong-113 are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Wong-113 with the method/system of Zhu-Wong-645 in order to provide users with a means for calculating overall score or downloading priority for each URL Wong-113: para 0003].

Claims 20, 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over US Pat No 8136025 by Zhu in view of US Pub No 2003/0093645 by Wong et al (hereinafter “Wong-645”) further in view of US Pub No 20080104113 by Wong et al (hereinafter “Wong-113”), further in view of US 20120158693 by Papadimitriou et al.

Regarding independent claim 20, Zhu teaches “A computer-implemented method, comprising:”
“selecting one of plurality of buckets within a table to be reviewed” (Zhu: col. 6, line 38-46)
(col. 6, 38-46: Step 302. In step 302 URL scheduler 202 determines which URLs will be crawled in each epoch, and stores that information in data structure 100. Controller 201 selects a segment 112 from base layer 102 (“selecting one of plurality of buckets within a table to be reviewed”) for crawling. The selected segment 112 is referred to herein as the "active segment." Typically, at the start of each epoch, controller 201 selects a different segment 112 from base layer 102 as the active segment so that, over the course of several epochs, all the segments 112 are selected for crawling in a round-robin style.
one of plurality of buckets within a table”.  The base layer 102 reads on “a table”.
“identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the table” (Zhu: col. 5, lines 31-37)
(col. 5, 31-37: Data structure for storing URLs. Referring to FIG. 1, a three-layer data structure 100 is illustrated. Base layer 102 (“the table”) of data structure 100 comprises a sequence of segments 112. In one embodiment, each segment 112 comprises more than two hundred million uniform resource locations (URLs) (“identifying a plurality of uniform resource locators (URLs) stored within the selected bucket”). Together, segments 112 represent a substantial percentage of the addressable URLs in the entire Internet.)
“for each of the plurality of identified URLs, processing the identified URL” (Zhu: col. 6, lines 38-46; col. 10, lines 5-12, lines 25-41)
(col. 5, lines 38-46: Periodically (e.g., daily) one of the segments 112 is deployed for crawling (“processing”) purposes, as described in more detail below. In addition to segments 112, there exists a daily crawl layer 104. In one embodiment, daily crawl layer 104 comprises more than fifty million URLs. Daily crawl layer 104 comprises the URLs that are to be crawled more frequently (“processing the identified URL”) than the URLs in segments 112. In addition, daily crawl layer 104 comprises high priority URLs that are discovered by system 200 during a current epoch.
	The fact that all the URLs of the selected segment are crawled periodically reads on “for each of the plurality of identified URLs, processing the identified URL”.)
Zhu does not explicitly teach; however, Wong-645 discloses, “a plurality of buckets within a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets (“a plurality of buckets”). In one embodiment, hash table 700 is comprised of multiple buckets (“a plurality of buckets within a hash table”) such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu and Wong-645 are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Wong-645with the method/system of Zhu in order to provide users with a means for receiving a plurality of objects from an origin server, computing a hash value based on source information about an object, and storing Wong-645: [0017]).
Zhu in view of Wong-645 teaches, “selecting one of plurality of buckets within a hash table to be reviewed; identifying a plurality of uniform resource locators (URLs) stored within the selected bucket of the hash table”.
Zhu-Wong-645 does not explicitly teach; however, Wong-113 teaches, “where the overall score for the identified URL is determined utilizing … global information associated with all URLs” ([0090])
([0090] The overall score or downloading priority for an outlinked URL is calculated (“the overall score for the identified URL is determined”) in response to at least some of the individual metric scores described above (“utilizing … global information associated with all URLs”). In practice, an overall score can be generated using any combination of the metrics described above (and, in some embodiments, in addition to other suitable metrics). For this example, the overall score is calculated in response to the domain density score, the anchor text score, the URL string score, and the category need score, and the overall score is also influenced by the link proximity score. In one embodiment, a "combined" score is generated from the domain density, anchor text, URL string, and 
Zu teaches “a page score for the identified URL” (Zhu: col. 7, line 54-col. 8, line 3)
(col. 7, line 54-col. 8, line 3: In some embodiments, it is not possible to crawl all the URLs in an active segment 112, daily layer 104, and real-time layer 106 during a given epoch. In one embodiment, this problem is addressed using two different approaches. In the first approach, a crawl score is computed for each URL (“a page score for the identified URL”) in active segment 112, daily layer 104, and real-time layer 106. Only those URLs that receive a high crawl score (e.g., above a threshold value) are passed on to the next stage (URL managers 204, FIG. 2). In the second approach, URL scheduler 202 refines an optimum crawl frequency for each such URL and passes the crawl frequency information on to URL managers 204. The crawl frequency information is then ultimately used by URL managers 204 to decide which URLs to crawl. These two approaches are not mutually exclusive and a combined methodology for prioritizing the URLs to crawl (based on both the crawl score and the optimum crawl frequency) may be used.
Zhu-Wong-645 and Wong-113 are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of Wong-113 with the method/system of Zhu-Wong-645 in order to provide users with a means for calculating overall score or downloading priority for each URL using at least some or individual scores for respective URL as shown in [Wong-113: para 0003].
Zhu-Wong-645-Wong-113 does not explicitly teach; however, Papadimitriou discloses,
“determining that an overall score for the identified URL meets or exceeds a threshold score” ([0083])
([0083] Thus, for a given topic associated with the query cluster of extended seed query string 418, a metric or score associated with a dominance of a single URL is calculated. In one example, percentile scores of D=0.9 (e.g., high rank score) and C=0.6 (e.g., average click score) may be determined for steps 802 and 804, and a domain importance score of 0.73 may be determined for step 806, indicating a relatively high level of URL dominance. Values of the URL dominance score that are less than 0.5 may indicate a relatively low level of URL dominance, while values of the URL dominance score that are greater than 0.5 may indicate a relatively high level of URL dominance (“determining that an overall score for the identified URL meets or exceeds a threshold score”) (1.0 indicates greatest URL dominance, 0.0 indicates no URL dominance). In other 
Zhu-Wong-645-Wong-113 and Papadimitriou are analogous art because they both are directed to the same field of processing Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Papadimitriou with the method/system of Zhu-Wong-645-Wong-113  in order to provide users with a means for generating recommendations of topics that lack or are unassociated with a dominant URL, and providing a set of relevant keywords and/or recommendations for titles or snippets that may be used to create a web page for the recommended topics as shown in [Papadimitriou: para 0011].
	Zhu-Wong-645-Wong-113 in view of Papadimitriou teaches “for each of the plurality of identified URLs, processing the identified URL in response to determining that an overall score for the identified URL meets or exceeds a threshold score, where the overall score for the identified URL is determined utilizing a page score for the identified URL and global information associated with all URLs within the hash table”.

As to claim 25, Zhu-Wong-645 teaches “wherein conditionally processing the identified URL includes:
“wherein: the global information includes a time since a last crawl, and a percentage of a global crawl goal that is not currently met” (Zhu: col. 14, lines 18-43; col. 12, lines 27-55)
(col. 14, lines 18-43: Referring to FIG. 5C, a history log 218 comprises a record 520 for each URL 522 that has been crawled by a robot 208. As illustrated in FIG. 5C, there are a wide range of possible fields that can be included in each record 520. One field is crawl status 524. Crawl status 524 indicates whether the corresponding URL 522 has been successfully crawled. Other field is the content checksum 526, also known as the content fingerprint. When pages have identical content, they will also have the same content fingerprint 526. URL scheduler 202 can compare these content fingerprint with a previous content fingerprint obtained for the corresponding URL (identified by URL fingerprint 522 in the history log record 520) on a previous crawl to ascertain whether the web page has changed since the last crawl. Similarly, URL scheduler 202 can use link checksum 530 to determine whether any of the outbound links on the web page associated with the corresponding URL 522 have changed since the last crawl. Source 532 provides an indication of whether robot 208 accessed the URL using the Internet or an internal repository of URLs. "Time taken to download" 534 provides an indication of how long it took a robot 208 to download the web page associated with the corresponding URL in the last crawl (“a time since a last crawl”). Error condition 536 records any errors that were encountered by a robot 208 during the crawl. An example of an error condition is "HTTP 404", which indicates that the web page does not exist.
col. 12, lines 27-55: Step 308. In step 308, a plurality of robots 208 crawl URLs that are provided to the robots 208 by URL server 206. In some embodiments, robots 208 use a calling process that requires domain name system (DNS) resolution. DNS resolution is the process by which host names (URLs) are resolved into their Internet Protocol (IP) addresses using a database that provides a mapping between host names (URLs) and IP addresses. In some embodiments, enhancements to known DNS resolution schemes are provided in order to prevent DNS resolution from becoming a bottleneck to the web crawling process, in which hundreds of millions of URLs must be resolved in a matter of hours. One of these enhancements is the use of a dedicated local database 250 (FIG. 2) that stores the IP addresses for URLs that have been crawled by system 200 in the past, which reduces the system's reliance on DNS servers on the Internet. This allows URLs that have been previously crawled by system 200 to be pre-resolved with respect to DNS resolution. The use of a local DNS resolution database 250 enables a high percentage of the system's DNS resolution operations to be handled locally, at very high speed. Only those URLs that are not represented on local DNS database 250 (e.g., because they have not been previously crawled) (“a percentage of a global crawl goal that is not currently met”) are resolved using conventional DNS resources of the Internet. As a result, the IP addresses of URLs are readily accessible when they are needed by a robot 208. Also, the system presents a much lower load on the DNS servers that would otherwise be needed to perform DNS resolution on every URL to be crawled.)
“the metadata for the identified URL includes a page score for the URL based on a link authority for the URL, a last crawl time for the URL, and changes made to a page associated with the URL since a last crawl” (Zhu: col. 7, line 54-col. 8, line3; col. 6, line 60-col. 7, line 11)
(col. 7, line 54-col. 8, line 3: In some embodiments, it is not possible to crawl all the URLs in an active segment 112, daily layer 104, and real-time layer 106 during a given epoch. In one embodiment, this problem is addressed using two different approaches. In the first approach, a crawl score is computed for each URL (“a page score for the URL”) in active segment 112, daily layer 104, and real-time layer 106. Only those URLs that receive a high crawl score (e.g., above a threshold value) are passed on to the next stage (URL managers 204, FIG. 2). In the second approach, URL scheduler 202 refines an optimum crawl 
col. 6, line 60-col. 7, line 11: The mechanism by which URL scheduler 202 obtains URL change frequency data is best understood by reviewing FIG. 2. When a URL is accessed by a robot 208, the information is passed through content filters 210. Content filters 210, among other things, determine whether a URL has changed and when a URL was last accessed by a robot 208. This information is placed in history logs 218, which are passed back to URL scheduler 202. By reviewing the log records for a particular URL, each of which indicates whether the content of a URL changed since the immediately previous time the URL was crawled (“a last crawl time”), the URL schedule 202 (or other module) can compute a URL change frequency. This technique is particularly useful for identifying URL's whose content (i.e., the content of the page at the URL) changes very infrequently (“changes made to a web page associated with the URL since last crawl”), or perhaps not at all. Furthermore, the computation of a URL change frequency can include using 
col. 21, lines 48-52: Documents are generally assigned DocIDs in the order in which their respective URLs are crawled, so this results in earlier crawled documents (which may have been scheduled to be crawled earlier due to their authority or importance) (“a link authority for the URL”) having lower DocIDs.)
Zhu-Wong-645 does explicitly teach; however, Wong-113 teaches,
“determining an overall score for the identified URL, based on global information associated with a full set of known URLs in addition to metadata for the identified URL” ([0090])
([0090] The overall score or downloading priority for an outlinked URL is calculated (“determining an overall score for the identified URL”) in response to at least some of the individual metric scores described above (“on global information associated with a full set of known URLs in addition to metadata for the identified URL”). In practice, an overall score can be generated using any combination of the metrics described above (and, in some embodiments, in addition to other suitable metrics). For this example, the overall score is calculated in response to the domain density score, the anchor text score, the URL string score, and the category need score, and the overall score is also influenced by the link proximity score. In one embodiment, a "combined" score is generated from the domain density, anchor text, URL string, and category need scores, and that combined score is adjusted using the link proximity score to obtain the downloading priority.)
“the global information includes… a total number of pages crawled within the table, a total number of pages not crawled within the table” ([0107] [0003])
([0107] As described previously, web crawling process 300 may be suitably formatted to generate a plurality of scores for the URL, where each of the scores indicates a different measure related to whether the URL corresponds to a desired web page type, such as a commercial product or a subcategory of commercial products. Again, any number of scoring metrics may be employed by an embodiment of a web crawler system. In this example, process 300 generates a domain density score (task 312) in response to the domain of the URL being analyzed. As explained above in connection with the domain density metric, task 312 may be performed to obtain a ratio of a number of indexed pages to a number of processed pages (where the number of indexed pages represents a number of web pages from the domain having the desired web page type, and where the number of processed pages represents a total number of web pages from the domain processed by the web crawling system (“a total number of web pages crawled”)). In addition, task 312 may be performed to calculate the domain density score from the ratio using an appropriate algorithm or formula.
[0003] Techniques and technologies described herein are applicable for use in connection with a web crawler application. The web crawler application is controlled to download web pages in a targeted and prioritized manner that focuses on at least one designated category or type of web page. The web crawler application employs or is influenced by a suitably configured URL scoring module that makes crawling and indexing Internet documents more efficient, thus enabling such indexing to be performed with less computation and hardware. The URL scoring module achieves higher efficiency by predicting the location of target documents and directing the web crawler application towards these documents. The URL scoring module generates different scores (using different techniques or metrics) for URLs that identify web pages that have not yet been downloaded by the web crawler (“a total number of web pages not crawled”). An overall score or downloading priority is calculated for each URL using at least some of the individual scores for the respective URL. The web crawler application downloads URLs in an order that is influenced by the overall scores.)
Wong-645 discloses, “a hash table” ([0050]; Fig. 7)
[0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets. In one embodiment, hash table 700 (“a hash table”) is comprised of multiple buckets such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu-Wong-645 and Wong-113 are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Wong-113 with the method/system of Zhu-Wong-645 in order to provide users with a means for calculating overall score or downloading priority for each URL using at least some or individual scores for respective URL as shown in [Wong-113: para 0003].
Zhu-Wong-645in view Wong teaches “wherein: the global information includes a time since a last crawl, a total number of pages crawled within the hash table, a total number of pages not crawled within the hash table, and a percentage of a global crawl goal that is not currently met”.
Zhu-Wong-645-Wong-113 does explicitly teach; however, Papadimitriou teaches,
“comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the table, and processing the URL in response to determining that the overall score exceeds the threshold score” ([0083])
([0083] Thus, for a given topic associated with the query cluster of extended seed query string 418, a metric or score associated with a dominance of a single URL is calculated. In one example, percentile scores of D=0.9 (e.g., high rank score) and C=0.6 (e.g., average click score) may be determined for steps 802 and 804, and a domain importance score of 0.73 may be determined for step 806, indicating a relatively high level of URL dominance. Values of the URL dominance score that are less than 0.5 may indicate a relatively low level of URL dominance, while values of the URL dominance score that are greater than 0.5 may indicate a relatively high level of URL dominance (“the threshold score is an average overall score… processing the URL in response to determining that the overall score exceeds the threshold score”) (1.0 indicates greatest URL dominance, 0.0 indicates no URL dominance). In other embodiments, a URL dominance importance score may be determined in other ways, using the same information or different information.)
Wong-645 discloses, “a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets. In one embodiment, hash table 700 (“a hash table”) is comprised of multiple buckets 
Zhu-Wong-645-Wong-113 and Papadimitriou are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Papadimitriou with the method/system of Zhu-Wong-645-Wong-113 in order to provide users with a means for generating recommendations of topics that lack or are unassociated with a dominant URL, and providing a set of relevant keywords and/or recommendations for titles or snippets that may be used to create a web page for the recommended topics as shown in [Papadimitriou: para 0011].
Zhu-Wong-645-Wong-113 in view Papadimitriou teaches “comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the hash table, and processing the URL in response to determining that the overall score exceeds the threshold score”.

As to claim 26, Zhu teaches “wherein:”
“each bucket represents a range of URLs, and each URL is represented once within the hash table, each of the URLs is stored within the bucket as a digest, each of the URLs is converted into a sequence of bytes represented by the digest utilizing a hash function, the data associated with each of the URLs is linked to its associated digest and is stored with the digest within the hash table” (Zhu: col. 7, lines 30-53; col. 9, line 65-col. 10, line 41; col. 22, line 59-col. 23, line 47]; Wong: [0050]; Fig. 7)
“the data associated with each of the URLs includes metadata including: a page score for the URL based on a link authority for the URL, a last crawl time for the URL, and changes made to a page associated with the URL since a last crawl” (col. 7, line 54-col. 8, line3; col. 6, line 60-col. 7, line 11)
(col. 7, line 54-col. 8, line 3: In some embodiments, it is not possible to crawl all the URLs in an active segment 112, daily layer 104, and real-time layer 106 during a given epoch. In one embodiment, this problem is addressed using two different approaches. In the first approach, a crawl score is computed for each URL (“a page score for the URL”) in active segment 112, daily layer 104, and real-time layer 106. Only those URLs that receive a high crawl score (e.g., above a threshold value) are passed on to the next stage (URL managers 204, FIG. 2). In the second approach, URL scheduler 202 refines an optimum crawl frequency for each such URL and passes the crawl frequency information on to URL managers 204. The crawl frequency information is then ultimately used by URL managers 204 to 
col. 6, line 60-col. 7, line 11: The mechanism by which URL scheduler 202 obtains URL change frequency data is best understood by reviewing FIG. 2. When a URL is accessed by a robot 208, the information is passed through content filters 210. Content filters 210, among other things, determine whether a URL has changed and when a URL was last accessed by a robot 208. This information is placed in history logs 218, which are passed back to URL scheduler 202. By reviewing the log records for a particular URL, each of which indicates whether the content of a URL changed since the immediately previous time the URL was crawled (“a last crawl time”), the URL schedule 202 (or other module) can compute a URL change frequency. This technique is particularly useful for identifying URL's whose content (i.e., the content of the page at the URL) changes very infrequently (“changes made to a web page associated with the URL since last crawl”), or perhaps not at all. Furthermore, the computation of a URL change frequency can include using supplemental information about the URL. For instance, the URL scheduler 202 may maintain or access information about web sites (i.e., URL's) whose content is known to change quickly.
col. 21, lines 48-52: Documents are generally assigned DocIDs in the order in which their respective URLs are crawled, so this results in earlier crawled documents (which may have been scheduled to be crawled earlier due to their authority or importance) (“a link authority for the URL”) having lower DocIDs.)
“the global information includes a time since a last crawl, and a percentage of a global crawl goal that is not currently met” (col. 14, lines 18-43; col. 12, lines 27-55)
(col. 14, lines 18-43: Referring to FIG. 5C, a history log 218 comprises a record 520 for each URL 522 that has been crawled by a robot 208. As illustrated in FIG. 5C, there are a wide range of possible fields that can be included in each record 520. One field is crawl status 524. Crawl status 524 indicates whether the corresponding URL 522 has been successfully crawled. Other field is the content checksum 526, also known as the content fingerprint. When pages have identical content, they will also have the same content fingerprint 526. URL scheduler 202 can compare these content fingerprint with a previous content fingerprint obtained for the corresponding URL (identified by URL fingerprint 522 in the history log record 520) on a previous crawl to ascertain whether the web page has changed since the last crawl. Similarly, URL scheduler 202 can use link checksum 530 to determine whether any of the outbound links on the web crawl. Source 532 provides an indication of whether robot 208 accessed the URL using the Internet or an internal repository of URLs. "Time taken to download" 534 provides an indication of how long it took a robot 208 to download the web page associated with the corresponding URL in the last crawl (“a time since a last crawl”). Error condition 536 records any errors that were encountered by a robot 208 during the crawl. An example of an error condition is "HTTP 404", which indicates that the web page does not exist.
col. 12, lines 27-55: Step 308. In step 308, a plurality of robots 208 crawl URLs that are provided to the robots 208 by URL server 206. In some embodiments, robots 208 use a calling process that requires domain name system (DNS) resolution. DNS resolution is the process by which host names (URLs) are resolved into their Internet Protocol (IP) addresses using a database that provides a mapping between host names (URLs) and IP addresses. In some embodiments, enhancements to known DNS resolution schemes are provided in order to prevent DNS resolution from becoming a bottleneck to the web crawling process, in which hundreds of millions of URLs must be resolved in a matter of hours. One of these enhancements is the use of a dedicated local database 250 (FIG. 2) that stores the IP addresses for URLs that have been crawled by system 200 in the a high percentage of the system's DNS resolution operations to be handled locally, at very high speed. Only those URLs that are not represented on local DNS database 250 (e.g., because they have not been previously crawled) (“a percentage of a global crawl goal that is not currently met”) are resolved using conventional DNS resources of the Internet. As a result, the IP addresses of URLs are readily accessible when they are needed by a robot 208. Also, the system presents a much lower load on the DNS servers that would otherwise be needed to perform DNS resolution on every URL to be crawled.)
Zhu-Wong-645 does explicitly teach; however, Wong-113 teaches,
“determining an overall score for the identified URL, based on global information associated with a full set of known URLs in addition to the metadata for the identified URL” ([0090])
([0090] The overall score or downloading priority for an outlinked URL is calculated (“determining an overall score for the identified URL”) in response to at least some of the individual metric scores described above (“on global information associated with a full set of known URLs in addition to metadata for the identified URL”). In practice, an overall score can be generated using any combination of the metrics described above (and, in some embodiments, in addition to other suitable metrics). For this example, the overall score is calculated in response to the domain density score, the anchor text score, the URL string score, and the category need score, and the overall score is also influenced by the link proximity score. In one embodiment, a "combined" score is generated from the domain density, anchor text, URL string, and category need scores, and that combined score is adjusted using the link proximity score to obtain the downloading priority.)
“the global information including… a total number of pages crawled within the hash table, a total number of pages not crawled within the hash table” ([0107] [0003])
([0107] As described previously, web crawling process 300 may be suitably formatted to generate a plurality of scores for the URL, where each of the scores indicates a different measure related to whether the URL corresponds to a desired web page type, such as a commercial product or a subcategory of commercial products. Again, any number of scoring metrics may be employed by an embodiment of a web crawler system. In this example, process 300 generates a domain density score (task 312) in response to the domain of the URL being analyzed. As explained above in connection with the domain density metric, the number of processed pages represents a total number of web pages from the domain processed by the web crawling system (“a total number of web pages crawled”)). In addition, task 312 may be performed to calculate the domain density score from the ratio using an appropriate algorithm or formula.
[0003] Techniques and technologies described herein are applicable for use in connection with a web crawler application. The web crawler application is controlled to download web pages in a targeted and prioritized manner that focuses on at least one designated category or type of web page. The web crawler application employs or is influenced by a suitably configured URL scoring module that makes crawling and indexing Internet documents more efficient, thus enabling such indexing to be performed with less computation and hardware. The URL scoring module achieves higher efficiency by predicting the location of target documents and directing the web crawler application towards these documents. The URL scoring module generates different scores (using different techniques or metrics) for URLs that identify web pages that have not yet been downloaded by the web crawler (“a total number of web pages not crawled”). 
Zhu and Wong are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Wong with the method/system of Zhu in order to provide users with a means for calculating overall score or downloading priority for each URL using at least some or individual scores for respective URL as shown in [Wong: para 0003].
Wong-645 discloses, “a hash table” ([0050]; Fig. 7)
([0050] FIG. 7 is a block diagram of a hash table, consistent with the present invention, with URL buckets. In one embodiment, hash table 700 (“a hash table”) is comprised of multiple buckets such as, for example, URL buckets 710, 720 and 730. Each URL bucket contains entries as described in connection with FIG. 6.)
Zhu-Wong-645 in view Wong-113 teaches “determining an overall score for the identified URL, based on global information associated with a full set of known URLs in addition to the metadata for the identified URL, the global information including a time since a last crawl, a total number of pages crawled within the hash table, a total number of pages not crawled within the hash table, and a percentage of a global crawl goal that is not currently met”.
Zhu-Wong-645-Wong-113 does explicitly teach; however, Papadimitriou teaches,
“comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the table, and processing the URL in response to determining that the overall score exceeds the threshold score” ([0083])
([0083] Thus, for a given topic associated with the query cluster of extended seed query string 418, a metric or score associated with a dominance of a single URL is calculated. In one example, percentile scores of D=0.9 (e.g., high rank score) and C=0.6 (e.g., average click score) may be determined for steps 802 and 804, and a domain importance score of 0.73 may be determined for step 806, indicating a relatively high level of URL dominance. Values of the URL dominance score that are less than 0.5 may indicate a relatively low level of URL dominance, while values of the URL dominance score that are greater than 0.5 may indicate a relatively high level of URL dominance (“the threshold score is an average overall score… processing the URL in response to determining that the overall score exceeds the threshold score”) (1.0 indicates greatest URL dominance, 0.0 
Zhu-Wong-645-Wong-113 and Papadimitriou are analogous art because they both are directed to the same field of crawling Uniform Resource Locators. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have combined the teachings of Papadimitriou with the method/system of Zhu-Wong-645-Wong-113 in order to provide users with a means for generating recommendations of topics that lack or are unassociated with a dominant URL, and providing a set of relevant keywords and/or recommendations for titles or snippets that may be used to create a web page for the recommended topics as shown in [Papadimitriou: para 0011].
Zhu-Wong-645-Wong-113 in view of Papadimitriou teaches, “comparing the overall score to a threshold score, where the threshold score is an average overall score for all URLs within the hash table, and processing the URL in response to determining that the overall score exceeds the threshold score”.


Conclusion
	THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

	Any inquiry concerning this communication or earlier communications from the examiner should be directed to BAO G TRAN whose telephone number is (571)270-3493.  The examiner can normally be reached on Mon-Fri 6:30-3:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Boris Gorney can be reached on (571)270-5626.  The fax phone number for the 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/BORIS GORNEY/Supervisory Patent Examiner, Art Unit 2158                                                                                                                                                                                                        



/BAO G TRAN/Patent Examiner of Art Unit 2158