GrammR: Graphical representation and modeling of count data with application in metagenomics

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

Original languageEnglish (US)
Pages (from-to)1648-1654
Number of pages7
JournalBioinformatics
Volume31
Issue number10
DOIs
StatePublished - Jan 1 2015
Externally publishedYes

Fingerprint

Graphical Modeling
Metagenomics
Count Data
Microbiota
Graphical Representation
High-dimensional
Pulse amplitude modulation
Phylogenetic Tree
Dissimilarity
Substructure
Health
Scaling
Dissimilarity Measure
Obesity
Clustering Analysis
Cluster Analysis
Alternatives
Dimensional Analysis
Number of Clusters
Chemical analysis

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

GrammR : Graphical representation and modeling of count data with application in metagenomics. / Ayyala, Deepak Nag; Lin, Shili.

In: Bioinformatics, Vol. 31, No. 10, 01.01.2015, p. 1648-1654.

Research output: Contribution to journalArticle

@article{4086fc5463194a819bd1541e67b04262,
title = "GrammR: Graphical representation and modeling of count data with application in metagenomics",
abstract = "Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.",
author = "Ayyala, {Deepak Nag} and Shili Lin",
year = "2015",
month = "1",
day = "1",
doi = "10.1093/bioinformatics/btv032",
language = "English (US)",
volume = "31",
pages = "1648--1654",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "10",

}

TY - JOUR

T1 - GrammR

T2 - Graphical representation and modeling of count data with application in metagenomics

AU - Ayyala, Deepak Nag

AU - Lin, Shili

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

AB - Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

UR - http://www.scopus.com/inward/record.url?scp=84929612384&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929612384&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv032

DO - 10.1093/bioinformatics/btv032

M3 - Article

C2 - 25609792

AN - SCOPUS:84929612384

VL - 31

SP - 1648

EP - 1654

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 10

ER -