GrammR: Graphical representation and modeling of count data with application in metagenomics

Deepak Nag Ayyala; Shili Lin

doi:10.1093/bioinformatics/btv032

GrammR: Graphical representation and modeling of count data with application in metagenomics

Deepak Nag Ayyala, Shili Lin

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

Original language	English (US)
Pages (from-to)	1648-1654
Number of pages	7
Journal	Bioinformatics
Volume	31
Issue number	10
DOIs	https://doi.org/10.1093/bioinformatics/btv032
State	Published - May 15 2015
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1093/bioinformatics/btv032

Cite this

@article{4086fc5463194a819bd1541e67b04262,

title = "GrammR: Graphical representation and modeling of count data with application in metagenomics",

abstract = "Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.",

author = "Ayyala, {Deepak Nag} and Shili Lin",

note = "Publisher Copyright: {\textcopyright} The Author 2015. Published by Oxford University Press.",

year = "2015",

month = may,

day = "15",

doi = "10.1093/bioinformatics/btv032",

language = "English (US)",

volume = "31",

pages = "1648--1654",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "10",

}

TY - JOUR

T1 - GrammR

T2 - Graphical representation and modeling of count data with application in metagenomics

AU - Ayyala, Deepak Nag

AU - Lin, Shili

N1 - Publisher Copyright: © The Author 2015. Published by Oxford University Press.

PY - 2015/5/15

Y1 - 2015/5/15

N2 - Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

AB - Motivation: Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. Results: We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

UR - http://www.scopus.com/inward/record.url?scp=84929612384&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929612384&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv032

DO - 10.1093/bioinformatics/btv032

M3 - Article

C2 - 25609792

AN - SCOPUS:84929612384

SN - 1367-4803

VL - 31

SP - 1648

EP - 1654

JO - Bioinformatics

JF - Bioinformatics

IS - 10

ER -

GrammR: Graphical representation and modeling of count data with application in metagenomics

Abstract

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this