Graph-Theoretic Algorithms to Improve Phylogenomic Analyses
NSF grant CCF-1535977,
We are developing new theoretical computer science and discrete algorithms for improving the estimation of large species and gene trees, and specifically enabling statistical methods to scale to ultra-large datasets.
- Tandy Warnow (overall PI)
- Chandra Chekuri (Co-PI at UIUC)
- Satish Rao (PI at Berkeley)
- Sarah Christensen (PhD student in Computer Science at UIUC)
- Erin Molloy (PhD student in Computer Science at UIUC)
- Pranjal Vachaspati (PhD student in Computer Science at UIUC)
- Richard Zhang (PhD student in Mathematics at Berkeley)
Funding: U.S. National Science Foundation grant CCF-1535977
(Algorithms in the Field).
Understanding the history of life on earth -
how species evolved from their common ancestor - is a major goal of biological research. These evolutionary trees are very hard to construct with high accuracy, because nearly all of the most accurate approaches require the solution to computationally hard optimization problems. Furthermore, research has shown that the evolutionary tree for a single gene can be different from the evolutionary tree for the species, and current methods do not provide adequate accuracy on genome-scale data. As a result, large evolutionary trees, covering big portions of "The Tree of Life", are very difficult to compute with high accuracy. This project will develop methods that can enable highly accurate species tree estimation. The key approach is the development of novel divide-and-conquer strategies, whereby a dataset is divided into overlapping subsets, species trees are constructed on the subsets, and then the subset species trees are merged together into a tree on the full dataset. These approaches will be combined with powerful statistical estimation methods, to potentially transform the capability of evolutionary biologists to analyze their data. This project will also provide open source software for the new methods that are developed, and provide training in the use of the software to biologists at national meetings. The project will also contribute to interdisciplinary training for two doctoral students, one at Illinois and one at Berkeley, and course materials for computational biology will be made available online.
P. Vachaspati and T. Warnow (2016). FastRFS: Fast and accurate Robinson-Foulds Supertrees using constrained exact optimization. Bioinformatics, special issue for RECOMB-CG.
S. Christensen, E. Molloy, P. Vachaspati, and T. Warnow.
OCTAL: Optimal Completion of Gene Trees in Polynomial Time. Algorithms for Molecular Biology, 13:6. https://almob.biomedcentral.com/articles/10.1186/s13015-018-0124-5
P. Vachaspati and T. Warnow. SIESTA: Enhancing searches for optimal supertrees and species trees. BMC Genomics, to appear. Special issue for selected papers from RECOMB-CG, 2017
M. Nute, E. Molloy, J. Chou, and T. Warnow. The Performance of Coalescent-Based Species Tree Estimation Methods under Models of Missing Data. BMC Genomics, to appear. Special issue for selected papers from RECOMB-CG, 2017.
P. Vachaspati and T. Warnow.
SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Molecular Phylogenetics and Evolution 2018
M. Nute, E. Molloy, J. Chou, and T. Warnow.
The Performance of Coalescent-Based Species Tree Estimation Methods
under Models of Missing Data. BMC Genomics, to appear. Special issue for selected papers from RECOMB-CG, 2017.
E. Molloy and T. Warnow.
To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.
Systematic Biology, Volume 67, Issue 2, 1 March 2018, pages 285-303, (first appeared in 2017)
Md S. Bayzid and T. Warnow.
Gene tree parsimony for incomplete gene trees: addressing true
biological loss. Algorithms for Molecular Biology, 13:1, special issue for selected papers from WABI 2017.
FastRFS, a fast supertree method.
Open source software at
FastRFS is a dynamic programming algorithm that finds an
exact solution to the Robinson-Foulds Supertree problem within
a constrained search space. This study shows that
FastRFS is very fast, and can analyze very large datasets with
thousands of taxa. In addition, FastRFS is more
accurate than major alternative supertree methods.
OCTAL, open source software available at
OCTAL (and its improved version,
OCTAL-II) is a method for gene tree correction,
given a reference species tree.
SIESTA, open source software available at
is a method for computing
statistics and consensus trees for
species tree methods (such as ASTRAL,
FastRFS, and SVDquest) that operate
using dynamic programming.
SVDquest, open source software available at
is a method for exactly solving the maximum
quartet compatibility problem
within a constrained search space.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.