TY - GEN
T1 - Stratification based hierarchical clustering over a deep web data source
AU - Liu, Tantan
AU - Agrawal, Gagan
PY - 2012
Y1 - 2012
N2 - This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.
AB - This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.
UR - http://www.scopus.com/inward/record.url?scp=84880221716&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880221716&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972825.7
DO - 10.1137/1.9781611972825.7
M3 - Conference contribution
AN - SCOPUS:84880221716
SN - 9781611972320
T3 - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
SP - 70
EP - 81
BT - Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012
PB - Society for Industrial and Applied Mathematics Publications
T2 - 12th SIAM International Conference on Data Mining, SDM 2012
Y2 - 26 April 2012 through 28 April 2012
ER -