Record Details

Focused crawling: a new approach to topic-specific Web resource discovery

DSpace at IIT Bombay

View Archive Info
 
 
Field Value
 
Title Focused crawling: a new approach to topic-specific Web resource discovery
 
Creator CHAKRABARTI, SOUMEN
BERG, MARTIN VAN DEN
DOM, BYRON
 
Subject data reduction
data structures
hypertext systems
search engines
 
Description The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.

To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
 
Publisher Elsevier
 
Date 2009-05-08T02:37:19Z
2011-12-08T06:58:32Z
2011-12-26T13:01:53Z
2011-12-27T05:47:34Z
2009-05-08T02:37:19Z
2011-12-08T06:58:32Z
2011-12-26T13:01:53Z
2011-12-27T05:47:34Z
1999
 
Type Article
 
Identifier Computer Networks 31(11-16), 1623-1640
1389-1286
10.1016/S1389-1286(99)00052-3
http://hdl.handle.net/10054/1305
http://dspace.library.iitb.ac.in/xmlui/handle/10054/1305
 
Language en