Stratification based hierarchical clustering over a deep web data source

Tantan Liu; Gagan Agrawal

doi:10.1137/1.9781611972825.7

Stratification based hierarchical clustering over a deep web data source

Tantan Liu, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

Original language	English (US)
Title of host publication	Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
Publisher	Society for Industrial and Applied Mathematics Publications
Pages	70-81
Number of pages	12
ISBN (Print)	9781611972320
DOIs	https://doi.org/10.1137/1.9781611972825.7
State	Published - 2012
Externally published	Yes
Event	12th SIAM International Conference on Data Mining, SDM 2012 - Anaheim, CA, United States Duration: Apr 26 2012 → Apr 28 2012

Publication series

Name	Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

Conference

Conference	12th SIAM International Conference on Data Mining, SDM 2012
Country/Territory	United States
City	Anaheim, CA
Period	4/26/12 → 4/28/12

ASJC Scopus subject areas

Computer Science Applications

Access to Document

10.1137/1.9781611972825.7

Cite this

Liu, T., & Agrawal, G. (2012). Stratification based hierarchical clustering over a deep web data source. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012 (pp. 70-81). (Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611972825.7

Stratification based hierarchical clustering over a deep web data source. / Liu, Tantan; Agrawal, Gagan.
Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. Society for Industrial and Applied Mathematics Publications, 2012. p. 70-81 (Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Liu, T & Agrawal, G 2012, Stratification based hierarchical clustering over a deep web data source. in Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012. Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012, Society for Industrial and Applied Mathematics Publications, pp. 70-81, 12th SIAM International Conference on Data Mining, SDM 2012, Anaheim, CA, United States, 4/26/12. https://doi.org/10.1137/1.9781611972825.7

@inproceedings{9037a2b9330f43f9bb12a833d957afd9,

title = "Stratification based hierarchical clustering over a deep web data source",

abstract = "This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.",

author = "Tantan Liu and Gagan Agrawal",

year = "2012",

doi = "10.1137/1.9781611972825.7",

language = "English (US)",

isbn = "9781611972320",

series = "Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012",

publisher = "Society for Industrial and Applied Mathematics Publications",

pages = "70--81",

booktitle = "Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012",

address = "United States",

note = "12th SIAM International Conference on Data Mining, SDM 2012 ; Conference date: 26-04-2012 Through 28-04-2012",

}

TY - GEN

T1 - Stratification based hierarchical clustering over a deep web data source

AU - Liu, Tantan

AU - Agrawal, Gagan

PY - 2012

Y1 - 2012

N2 - This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

AB - This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

UR - http://www.scopus.com/inward/record.url?scp=84880221716&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880221716&partnerID=8YFLogxK

U2 - 10.1137/1.9781611972825.7

DO - 10.1137/1.9781611972825.7

M3 - Conference contribution

AN - SCOPUS:84880221716

SN - 9781611972320

T3 - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

SP - 70

EP - 81

BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012

PB - Society for Industrial and Applied Mathematics Publications

T2 - 12th SIAM International Conference on Data Mining, SDM 2012

Y2 - 26 April 2012 through 28 April 2012

ER -

Stratification based hierarchical clustering over a deep web data source

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this