A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Ashok Kumar Sharma, Robert Podolsky, Jieping Zhao, Richard A McIndoe

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k -means programs. The software was written in C# (.NET 1.1).

Original languageEnglish (US)
Pages (from-to)1152-1157
Number of pages6
JournalBioinformatics
Volume25
Issue number9
DOIs
StatePublished - May 7 2009

Fingerprint

Clustering algorithms
Large Data Sets
Hyperplane
Clustering Algorithm
Cluster Analysis
Clustering
Microarrays
Microarray Data
Gene
Genes
Distance Matrix
Software
K-means Algorithm
Data Reduction
K-means Clustering
Requirements
K-means
Gene Expression Data
Software Tools
Microarray

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. / Sharma, Ashok Kumar; Podolsky, Robert; Zhao, Jieping; McIndoe, Richard A.

In: Bioinformatics, Vol. 25, No. 9, 07.05.2009, p. 1152-1157.

Research output: Contribution to journalArticle

@article{f82b6dc14d3946b3b1af8f07e139828c,
title = "A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets",
abstract = "Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k -means programs. The software was written in C# (.NET 1.1).",
author = "Sharma, {Ashok Kumar} and Robert Podolsky and Jieping Zhao and McIndoe, {Richard A}",
year = "2009",
month = "5",
day = "7",
doi = "10.1093/bioinformatics/btp123",
language = "English (US)",
volume = "25",
pages = "1152--1157",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "9",

}

TY - JOUR

T1 - A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

AU - Sharma, Ashok Kumar

AU - Podolsky, Robert

AU - Zhao, Jieping

AU - McIndoe, Richard A

PY - 2009/5/7

Y1 - 2009/5/7

N2 - Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k -means programs. The software was written in C# (.NET 1.1).

AB - Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k -means programs. The software was written in C# (.NET 1.1).

UR - http://www.scopus.com/inward/record.url?scp=65449124221&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=65449124221&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btp123

DO - 10.1093/bioinformatics/btp123

M3 - Article

C2 - 19261720

AN - SCOPUS:65449124221

VL - 25

SP - 1152

EP - 1157

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 9

ER -