An approach to deep web crawling by sampling

Jianguo Lu*, Yan Wang, Jie Liang, Jessica Chen, Jiming LIU

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

36 Citations (Scopus)

Abstract

Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.

Original languageEnglish
Title of host publicationProceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008
Pages718-724
Number of pages7
DOIs
Publication statusPublished - 2008
Event2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008 - Sydney, NSW, Australia
Duration: 9 Dec 200812 Dec 2008

Publication series

NameProceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008

Conference

Conference2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008
Country/TerritoryAustralia
CitySydney, NSW
Period9/12/0812/12/08

Scopus Subject Areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'An approach to deep web crawling by sampling'. Together they form a unique fingerprint.

Cite this