Selecting queries from sample to crawl deep web data sources

Yan Wang*, Jianguo Lu, Jie Liang, Jessica Chen, Jiming LIU

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges in crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose to learn a set of queries from a sample of the data source. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Gov2, newsgroups, wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large. Compared with other query selection methods, our method obtains the queries by analyzing a small set of sample documents, instead of learning the next best query incrementally from all the documents matched with previous queries.

Original languageEnglish
Pages (from-to)75-88
Number of pages14
JournalWeb Intelligence and Agent Systems
Volume10
Issue number1
DOIs
Publication statusPublished - 2012

Scopus Subject Areas

  • Software
  • Computer Networks and Communications
  • Artificial Intelligence

User-Defined Keywords

  • crawling
  • Deep web
  • hidden web
  • invisible web
  • query selection
  • sampling
  • set covering
  • web service

Fingerprint

Dive into the research topics of 'Selecting queries from sample to crawl deep web data sources'. Together they form a unique fingerprint.

Cite this