New methods for multiple sequence alignment with improved accuracy and scalability
- Tandy Warnow (PI)
- Kodi Collins, REU student in Statistics (now PhD student at UCLA)
- Mike Nute, PhD student in Statistics (PhD expected May 2019)
- Ehsan Saleh, PhD student in Computer Science (rotation student)
Funding: U.S. National Science Foundation grant 1458652
Multiple sequence alignment (MSA) is one of the most basic bioinformatics steps, in which a set of molecular sequences (i.e., DNA, RNA, or amino acid sequences) are arranged inside a matrix to identify corresponding positions. MSA calculation is a fundamental first step in many biological analyses. Because of its broad applicability and importance, many MSA methods have been developed and are in wide use today. Unfortunately, many real world biological datasets have features (large size and fragmentary sequences, for example) that make accurate MSA calculation very difficult. Because poorly estimated alignments result in errors in downstream biological analyses, new MSA techniques are needed that can produce accurate alignments on difficult datasets. This project will develop MSA methods with greatly improved accuracy, and that can analyze the large and heterogeneous sequence datasets being assembled in different biology projects nationally. The project also has a substantial outreach component to women's colleges and minority serving institutions, and summer software schools to train biologists in the use of the project software.
Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.
N.-P. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG.
T. Warnow (2017). Computational Phylogenetics: An introduction to designing
methods for phylogeny estimation. Published by Cambridge University Press.
- PASTA (the
improvement to SATé, Liu et al. Science 2009), which co-estimates sequence alignments and trees,
analyze datasets with up to 1,000,000 sequences.
PASTA was developed by two former students of mine
(Siavash Mirarab and
Nam Nguyen) and has
contributions now from Mike Nute (current student).
PASTA+BAli-Phy is available at
this github site,
and is the work of Mike Nute (see Nute and Warnow, BMC Genomics 2016).
- UPP, a new technique for
multiple sequence alignment that can analyze datasets with up to 1,000,000
sequences and is highly robust to fragmentary sequences. UPP was developed
by Nam Nguyen and Siavash Mirarab (current and former students of mine).
- HIPPI: gene binning for protein sequences, using ensembles of HMMs.
(This is available on github at the website for UPP, see above)
Summer Symposia and Software Schools:
The grant will provide summer symposia and software schools to train researchers
(from students through faculty) in new multiple
sequence alignment methods, and other topics within phylogenomics.
- Summer 2015: 2015 Phylogenomics Symposium and Software School, May 18-19, 2015, at the University of Michigan in Ann Arbor, MI, as part of the Standalone Meeting of the Society for Systematic Biologists.
- Summer 2016: 2016 Phylogenomics Symposium and Software School, June 16-17, 2016, in Austin, Texas,
co-located with the Evolution 2016 meeting.
Advancing Genomic Biology through Novel Method Development,
June 5-6, 2017, at the Radcliffe Institute for Advanced Study.
This Exploratory Seminar was designed to discuss
three computational problems (phylogenomics, metagenomics,
and protein sequence analysis) where novel methods are needed to
advance discovery in the presence of large datasets; multiple-sequence
alignment methods is key to each of the three problems that were addressed.
Summer 2018: 2018 Phylogenomics Software Symposium,
Institut des Sciences de l'Evolution - Montpellier (ISEM), at the University
of Montpellier, August 17, 2018.
See http://tandy.cs.illinois.edu/talks.html for the full list of talks.
Specific talks relevant to this project are at
January 12, 2016. UC San Francisco (Patsy Babbit) (PDF)
- June 16, 2016. Phylogenomics Symposium, Advances in Multiple Sequence Alignment,
in Austin, TX (part of Evolution 2016)
- October 11, 2016.
RECOMB-CG, Montreal. Scaling statistical multiple sequence alignment to large datasets (PDF)
- May 30, 2017. Keynote talk at IPDPS (IEEE International Parallel and Distributed Processing) Symposium, Orlando FL.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.