RAPSearch: A fast protein similarity search tool for short reads

Yuzhen Ye, Jeong Hyeon Choi, Haixu Tang

Research output: Contribution to journalArticle

68 Citations (Scopus)

Abstract

Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

Original languageEnglish (US)
Article number159
JournalBMC Bioinformatics
Volume12
DOIs
StatePublished - May 15 2011
Externally publishedYes

Fingerprint

Similarity Search
Proteins
Protein
Molecular Sequence Annotation
Suffix Array
Metagenomics
Informatics
Open Source Software
Hits
Sequencing
Annotation
Seed
Amino Acids
Amino acids
Seeds
Speedup
DNA
Software
Coding
Genes

Keywords

  • Metagenomics
  • Reduced amino acid alphabet
  • Short reads similarity search
  • Suffix array

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

RAPSearch : A fast protein similarity search tool for short reads. / Ye, Yuzhen; Choi, Jeong Hyeon; Tang, Haixu.

In: BMC Bioinformatics, Vol. 12, 159, 15.05.2011.

Research output: Contribution to journalArticle

@article{d0f98a87a4e24ef9978d06aac1d98f7a,
title = "RAPSearch: A fast protein similarity search tool for short reads",
abstract = "Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2{\%}) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1{\%}) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.",
keywords = "Metagenomics, Reduced amino acid alphabet, Short reads similarity search, Suffix array",
author = "Yuzhen Ye and Choi, {Jeong Hyeon} and Haixu Tang",
year = "2011",
month = "5",
day = "15",
doi = "10.1186/1471-2105-12-159",
language = "English (US)",
volume = "12",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - RAPSearch

T2 - A fast protein similarity search tool for short reads

AU - Ye, Yuzhen

AU - Choi, Jeong Hyeon

AU - Tang, Haixu

PY - 2011/5/15

Y1 - 2011/5/15

N2 - Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

AB - Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

KW - Metagenomics

KW - Reduced amino acid alphabet

KW - Short reads similarity search

KW - Suffix array

UR - http://www.scopus.com/inward/record.url?scp=79955877212&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79955877212&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-12-159

DO - 10.1186/1471-2105-12-159

M3 - Article

C2 - 21575167

AN - SCOPUS:79955877212

VL - 12

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 159

ER -