ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Piotr Kraj; Ashok Sharma; Nikhil Garge; Robert Podolsky; Richard A. McIndoe

doi:10.1186/1471-2105-9-200

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Piotr Kraj, Ashok Sharma, Nikhil Garge, Robert Podolsky, Richard A. McIndoe

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

Background: During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results: The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion: ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.

Original language	English (US)
Article number	200
Journal	BMC Bioinformatics
Volume	9
DOIs	https://doi.org/10.1186/1471-2105-9-200
State	Published - Apr 16 2008

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/1471-2105-9-200

Cite this

@article{f37dc4570c4b48c8891049e702d9b023,

title = "ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use",

abstract = "Background: During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results: The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion: ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.",

author = "Piotr Kraj and Ashok Sharma and Nikhil Garge and Robert Podolsky and McIndoe, {Richard A.}",

note = "Funding Information: This work is supported by grant U01DK60966-01 from the National Institute of Diabetes Digestive and Kidney Diseases to RAM. We also thank Dr. Jin-Xiong She for the use of his microarray data to test the program.",

year = "2008",

month = apr,

day = "16",

doi = "10.1186/1471-2105-9-200",

language = "English (US)",

volume = "9",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - ParaKMeans

T2 - Implementation of a parallelized K-means algorithm suitable for general laboratory use

AU - Kraj, Piotr

AU - Sharma, Ashok

AU - Garge, Nikhil

AU - Podolsky, Robert

AU - McIndoe, Richard A.

N1 - Funding Information: This work is supported by grant U01DK60966-01 from the National Institute of Diabetes Digestive and Kidney Diseases to RAM. We also thank Dr. Jin-Xiong She for the use of his microarray data to test the program.

PY - 2008/4/16

Y1 - 2008/4/16

N2 - Background: During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results: The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion: ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.

AB - Background: During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results: The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion: ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.

UR - http://www.scopus.com/inward/record.url?scp=42949144951&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=42949144951&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-9-200

DO - 10.1186/1471-2105-9-200

M3 - Article

C2 - 18416829

AN - SCOPUS:42949144951

SN - 1471-2105

VL - 9

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 200

ER -

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this