TY - GEN
T1 - An approach to deep web crawling by sampling
AU - Lu, Jianguo
AU - Wang, Yan
AU - Liang, Jie
AU - Chen, Jessica
AU - LIU, Jiming
N1 - Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.
PY - 2008
Y1 - 2008
N2 - Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.
AB - Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.
UR - http://www.scopus.com/inward/record.url?scp=62949239222&partnerID=8YFLogxK
U2 - 10.1109/WIIAT.2008.392
DO - 10.1109/WIIAT.2008.392
M3 - Conference proceeding
AN - SCOPUS:62949239222
SN - 9780769534961
T3 - Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008
SP - 718
EP - 724
BT - Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008
T2 - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008
Y2 - 9 December 2008 through 12 December 2008
ER -