TY - GEN
T1 - Query planning for searching inter-dependent deep-web databases
AU - Wang, Fan
AU - Agrawal, Gagan
AU - Jin, Ruoming
PY - 2008
Y1 - 2008
N2 - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming what is referred to as the deep web. It is desirable to have systems that can provide a high-level and simple interface for users to query such data sources, and can automate data retrieval from the deep web. However, such systems need to address the following challenges. First, in most cases, no single database can provide all desired data, and therefore, multiple different databases need to be queried for a given user query. Second, due to the dependencies present between the deep-web databases, certain databases must be queried before others. Third, some database may not be available at certain times because of network or hardware problems, and therefore, the query planning should be capable of dealing with unavailable databases and generating alternative plans when the optimal one is not feasible. This paper considers query planning in the context of a deep-web integration system. We have developed a dynamic query planner to generate an efficient query order based on the database dependencies. Our query planner is able to select the top K query plans. We also develop cost models suitable for query planning for deep web mining. Our implementation and evaluation has been made in the context of a bioinformatics system, SNPMiner. We have compared our algorithm with a naive algorithm and the optimal algorithm. We show that for the 30 queries we used, our algorithm outperformed the naive algorithm and obtained very similar results as the optimal algorithm. Our experiments also show the scalability of our system with respect to the number of data sources involved and the number of query terms.
AB - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming what is referred to as the deep web. It is desirable to have systems that can provide a high-level and simple interface for users to query such data sources, and can automate data retrieval from the deep web. However, such systems need to address the following challenges. First, in most cases, no single database can provide all desired data, and therefore, multiple different databases need to be queried for a given user query. Second, due to the dependencies present between the deep-web databases, certain databases must be queried before others. Third, some database may not be available at certain times because of network or hardware problems, and therefore, the query planning should be capable of dealing with unavailable databases and generating alternative plans when the optimal one is not feasible. This paper considers query planning in the context of a deep-web integration system. We have developed a dynamic query planner to generate an efficient query order based on the database dependencies. Our query planner is able to select the top K query plans. We also develop cost models suitable for query planning for deep web mining. Our implementation and evaluation has been made in the context of a bioinformatics system, SNPMiner. We have compared our algorithm with a naive algorithm and the optimal algorithm. We show that for the 30 queries we used, our algorithm outperformed the naive algorithm and obtained very similar results as the optimal algorithm. Our experiments also show the scalability of our system with respect to the number of data sources involved and the number of query terms.
UR - http://www.scopus.com/inward/record.url?scp=49049100280&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49049100280&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-69497-7_5
DO - 10.1007/978-3-540-69497-7_5
M3 - Conference contribution
AN - SCOPUS:49049100280
SN - 3540694765
SN - 9783540694762
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 24
EP - 41
BT - Scientific and Statistical Database Management - 20th International Conference, SSDBM 2008, Proceedings
T2 - 20th International Conference on Scientific and Statistical Database Management, SSDBM 2008
Y2 - 9 July 2008 through 11 July 2008
ER -