Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Novel approaches to crawling important pages early

Authors
Alam, Md. HijbulHa, JongWooLee, SangKeun
Issue Date
Dec-2012
Publisher
SPRINGER LONDON LTD
Keywords
Web crawler; Crawl ordering; PageRank; Fractional PageRank
Citation
KNOWLEDGE AND INFORMATION SYSTEMS, v.33, no.3, pp.707 - 734
Indexed
SCIE
SCOPUS
Journal Title
KNOWLEDGE AND INFORMATION SYSTEMS
Volume
33
Number
3
Start Page
707
End Page
734
URI
https://scholar.korea.ac.kr/handle/2021.sw.korea/106789
DOI
10.1007/s10115-012-0535-4
ISSN
0219-1377
Abstract
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.
Files in This Item
There are no files associated with this item.
Appears in
Collections
Graduate School > Department of Artificial Intelligence > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher LEE, Sang Keun photo

LEE, Sang Keun
Department of Artificial Intelligence
Read more

Altmetrics

Total Views & Downloads

BROWSE