Scalable and Highly Accurate Methods for Metagenomics
This project is supported by the National Science
Foundation (NSF), through
award III:AF:Collaborative Research: 1513629.
This is a collaborative grant with the University of Maryland
at College Park (PI: Mihai Pop).
Dates: September 1, 2015 to August 31, 2019
PI: Tandy Warnow, Professor of
Computer Science and of Bioengineering
Co-PI: Bill Gropp,
Professor of Computer Science and Interim Director of the
National Center for Supercomputing Applications (NCSA)
Erin Molloy, PhD student in Computer Science
Nam-phuong Nguyen, postdoctoral researcher (now at UCSD)
Michael Nute, PhD student in Statistics
University of Maryland
- PI: Mihai Pop,
Professor of Computer Science and Interim Director of UMIACS
- Jeremy Selengut, Associate Research Scientist at UMIACS
- Todd Treangen, Assistant Research Scientist at UMIACS
- Nidhi Shah, PhD student in Computer Science
Metagenomic studies of microbial communities can generate millions to billions of sequencing
reads. The assignment of accurate taxonomic labels to these sequences is a critical component
in many analyses, but is complicated by the fact that the majority of the organisms found
in environmental or host-associated communities cannot be easily cultured in a laboratory.
Even among the organisms that can be cultured, relatively few have been sequenced, even partially.
Thus, many commonly encountered organisms are largely absent from existing databases of known
genomes and genes. Providing taxonomic labels to metagenomic sequences, thus, requires extrapolating
the knowledge contained in sequence databases to previously unseen DNA strings. Simple similarity-based
approaches (e.g., picking the best database hit as the best guess at the taxonomic label)
have been shown to be insufficiently accurate, leading to the development of more sophisticated
methods. Further developments are necessary to handle the characteristics of emerging sequencing
technologies, such as high error rates with large numbers of insertions and deletions. To
date, metagenomic taxon identification methods have been evaluated with respect to their ability
to estimate the distribution of bacterial taxa (species, genera, families, etc.) within a
metagenomic sample. Yet, different scientific and clinical settings may require specific types
of analyses, and this one type of evaluation may not be the most appropriate for all settings.
For example, in a clinical setting the most important question may be to detect whether a
specific pathogen is present, while in a scientific setting the most interesting question
may be to be able to determine if an observed read comes from a never-been-seen-before species.
New evaluation strategies must be developed that specifically target the specific needs of
the application domain.
We will address the challenges outlined above as follows. First, we will develop a new framework
for integrating the formal definition of biological use-cases with evaluation datasets and
metrics in order to ensure the software being developed adequately addresses the needs of
the end-users. Second, we will develop new approaches for marker-based taxon identification
and abundance profiling that can leverage multiple sources of information (e.g., multiple
markers) as well as handle the high error rates of third-generation sequencing technologies.
These approaches will build upon our experience developing TIPP - a taxonomic profiling package
recently published by us that outperforms the leading metagenomic taxonomic profiling software,
in particular for novel sequences, or for longer, high-error sequences. Finally we plan to
develop high-performance computing implementations of these methods in order to enable rapid
analysis of sample. Speed of analysis is particularly important in clinical settings where
medical treatments may depend on the rate at which the method can return an analysis. Speed
is also important in non-medical applications where faster analyses enable researchers to
perform deeper or broader analyses of microbial communities.
All the methods developed in the project will be made into open-source software that is freely
available to the scientific public. We will provide training activities each year with funds
available to students and postdocs from around the country, and an outreach program to minority
serving institutions and women's colleges. A summer REU program will also be provided at the
University of Maryland, College Park.
- N. Nguyen, T. Warnow, M. Pop, and B. White. "A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity," Npj Biofilms And Microbiomes, v.2, 2016. doi:doi:10.1038/npjbiofilms.2016.4
N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG.
T. Hansen, S. Mollerup, N. Nguyen, L. Vinner, N. White, M. Coghlan, D. Alquezar-Planas, T. Joshi, R. Jensen, H. Fridholm, K. Kjaransdottir, T. Mourier, T. Warnow, G. Belsham, T. Gilbert, L. Orlando, M. Bunce, E. Willerslev, L. Nielsen, and A. Hansen (2016). High diversity of picornaviruses in rats from different continents revealed by deep sequencing, Emerging Microbes
& Infections 5, e90, doi:doi:10.1038/emi.2016.90.
B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027.
- TIPP: taxonomic identification using phylogeny-aware profiles.
TIPP is available at the github
page for SEPP,
which also includes code for SEPP (phylogenetic
UPP (ultra-large alignment using ensembles
of HMMs), and HIPPI (protein family identification
for protein sequences), all methods that exploit the
Ensemble of HMM technique developed initially for SEPP.
- HIPPI: gene binning for protein sequences, using ensembles of HMMs. (This is available on github at the website for SEPP, see above)
PASTA+BAli-Phy is the integration of BAli-Phy (statistical method
to co-estimate multiple sequence alignments and trees) within
PASTA. The github site for this code is
and is the work of Mike Nute (see Nute and Warnow, BMC Genomics 2016).
improvement to SATé, Liu et al. Science 2009), which co-estimates sequence alignments and trees,
analyze datasets with up to 1,000,000 sequences.
Conferences and Software Schools
For the full list of talks, see this page.
- November 9, 2015. UCSD Distinguished Lecture,
Department of Computer Science.(PPT)
- January 4, 2016. Pacific Symposium on Biocomputing,
Special Session on
- January 12, 2016. UC San Francisco.
August 28-September 2, 2016.
Using Ensembles of HMMs for Grand Challenges in Bioinformatics, as part
of the Schloss Dagstuhl seminar
generation sequencing - Algorithms and Software for
October 11-14, 2016.
Scaling statistical multiple sequence alignment to large datasets.
October 17, 2016.
Georgia Tech, CSE Department Distinguished Lecture.
Genome-scale estimation of the Tree of Life.
November 2, 2016.
Mid-Atlantic Microbiome Meeting (M3), at the
University of Maryland.
November 17, 2016.
February 16-17, 2017.
Second Workshop on Statistical and Algorithmic Challenges in Microbiome Data Analysis at The Broad Institute of MIT and Harvard, in Cambridge, MA.
April 5-6, 2017.
NeLLi: From New Lineages of Life to New Functions
at the DOE Joint Genome Institute (JGI).
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.