Differential analysis on deep web data sources

Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

The growing use of Internet in everyday life has been creating new challenges and opportunities to use data mining techniques. A relatively new trend in the Internet is the deep web. As a large number of deep web data sources tend to provide similar data, an important problem is to perform offline analysis to understand the differences in data available from different sources. This paper introduces data mining methods to extract a high-level summary of the differences in data provided by different deep web data sources.We consider pattern of values with respect to the same entity and we formulate a new data mining problem, which we refer to as differential rule mining. We have developed an algorithm for mining such rules. Our method includes a pruning method to summarize the identified differential rules. For efficiency, a hash-table is used to accelerate the pruning process. We show the effectiveness, efficiency, and utility of our methods by analyzing data across four travel-related web-sites.

Original languageEnglish (US)
Title of host publicationProceedings - 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010
Pages33-40
Number of pages8
DOIs
StatePublished - 2010
Externally publishedYes
Event10th IEEE International Conference on Data Mining Workshops, ICDMW 2010 - Sydney, NSW, Australia
Duration: Dec 14 2010Dec 17 2010

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Conference

Conference10th IEEE International Conference on Data Mining Workshops, ICDMW 2010
Country/TerritoryAustralia
CitySydney, NSW
Period12/14/1012/17/10

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Differential analysis on deep web data sources'. Together they form a unique fingerprint.

Cite this