Stratification based hierarchical clustering over a deep web data source

Tantan Liu, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

Original languageEnglish (US)
Title of host publicationProceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
PublisherSociety for Industrial and Applied Mathematics Publications
Pages70-81
Number of pages12
ISBN (Print)9781611972320
DOIs
StatePublished - 2012
Externally publishedYes
Event12th SIAM International Conference on Data Mining, SDM 2012 - Anaheim, CA, United States
Duration: Apr 26 2012Apr 28 2012

Publication series

NameProceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

Conference

Conference12th SIAM International Conference on Data Mining, SDM 2012
Country/TerritoryUnited States
CityAnaheim, CA
Period4/26/124/28/12

ASJC Scopus subject areas

  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Stratification based hierarchical clustering over a deep web data source'. Together they form a unique fingerprint.

Cite this