Novel approaches to crawling important pages early
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Alam, Md. Hijbul | - |
dc.contributor.author | Ha, JongWoo | - |
dc.contributor.author | Lee, SangKeun | - |
dc.date.accessioned | 2021-09-06T12:31:48Z | - |
dc.date.available | 2021-09-06T12:31:48Z | - |
dc.date.created | 2021-06-14 | - |
dc.date.issued | 2012-12 | - |
dc.identifier.issn | 0219-1377 | - |
dc.identifier.uri | https://scholar.korea.ac.kr/handle/2021.sw.korea/106789 | - |
dc.description.abstract | Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank. | - |
dc.language | English | - |
dc.language.iso | en | - |
dc.publisher | SPRINGER LONDON LTD | - |
dc.title | Novel approaches to crawling important pages early | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Lee, SangKeun | - |
dc.identifier.doi | 10.1007/s10115-012-0535-4 | - |
dc.identifier.scopusid | 2-s2.0-84869092092 | - |
dc.identifier.wosid | 000310871700009 | - |
dc.identifier.bibliographicCitation | KNOWLEDGE AND INFORMATION SYSTEMS, v.33, no.3, pp.707 - 734 | - |
dc.relation.isPartOf | KNOWLEDGE AND INFORMATION SYSTEMS | - |
dc.citation.title | KNOWLEDGE AND INFORMATION SYSTEMS | - |
dc.citation.volume | 33 | - |
dc.citation.number | 3 | - |
dc.citation.startPage | 707 | - |
dc.citation.endPage | 734 | - |
dc.type.rims | ART | - |
dc.type.docType | Article | - |
dc.description.journalClass | 1 | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Information Systems | - |
dc.subject.keywordAuthor | Web crawler | - |
dc.subject.keywordAuthor | Crawl ordering | - |
dc.subject.keywordAuthor | PageRank | - |
dc.subject.keywordAuthor | Fractional PageRank | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
(02841) 서울특별시 성북구 안암로 14502-3290-1114
COPYRIGHT © 2021 Korea University. All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.