CS 581 Final Project Suggestions

The textbook (Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation) has several suggestions for projects, many of which would be good for this course. You may also have your own ideas for a project! The suggestions below are just to get you started with thinking about what you might do.

## Projects related to multiple sequence alignment

• Modify the sequence evolution simulator Indelible (Fletcher and Yang) so that it allows the root sequence to be specified. The code is open source and written in C++.
• Compare methods that can align two alignments, and evaluate for accuracy on simulated datasets. Examples of such methods: Opal, Muscle, and Prime.
• Design a method that can find outliers in a set of sequences (i.e., can find the sequences that are not homologous to the remaining sequences), under the assumption that nearly all the input set are homologous to each other.
• Test POY using gap penalties that are not affine gap penalties for accuracy.
• Examine ways of computing a consensus alignment on the output from PASTA, and evaluate for accuracy in comparison to the single alignment returned by PASTA. For example, compute the posterior decoding of a random sample of the alignments PASTA computes (after removing the first few alignments). To compute the posterior decoding, you can install BAli-Phy (Redelings and Suchard), using the directions on the website: http://balip-phy.org/README.html#installation. Once the software is installed, there is a folder of executable files called "bin", and within that folder there is a folder called "alignment-max".
• Examine ways to annotate the sites in a multiple sequence alignment for reliability, based on examining a set of alternate alignments obtained by running a basic alignment method using different strategies (e.g., different parameter settings).
• Find codes for computing a point estimate of an alignment given an arbitrary set of multiple sequence alignments (e.g., the posterior decoding algorithm in BAli-Phy, but also look at T-Coffee), and compare them for accuracy and/or scalability.
• Explore BAli-Phy to see if you can improve it. For example, determine if you can use it to score a given alignment/tree pair, to find a tree given a fixed alignment, or to find the best alignment on a fixed tree. Or see if you can improve it by giving it a good starting alignment/tree pair. Or see if it is missing any important substitution models, and if so modify it to enable them.

## Projects related to supertree estimation or phylogenomic species tree estimation

• Develop a good method for weighted quartet tree amalgamation. Compare to Weighted Quartets MaxCut by Avni et al. (2014), DOI: 10.1093/sysbio/syu087.
• Evaluate Weighted Quartets MaxCut (Avni et al. 2014) as a supertree method on the SMIDgen datasets (Swenson et al.).
• Implement a parallel version of some good supertree method, such as FastRFS (Vachaspati and Warnow), Quartets MaxCut (Snir and Rao), etc.
• Quartet-based supertree methods have computational limitations in that they depend on computing all quartet trees. Evaluate variants that only require computing a subset of the quartet trees.
• Evaluate the impact of multiple sequence alignment error on SVDquartets (Chifman and Kubatko, as implemented in PAUP*).

## Projects related to maximum likelihood gene tree estimation

• Develop parallel implementation of FastTree-2 (Price et al.)
• Test some leading maximum likelihood methods on large datasets (be careful - this will require a lot of computing time).
• Test some leading maximum likelihood method (e.g., RAxML) on simulated datasets that evolved with heterotachy, and compare to maximum parsimony. Note: I don't know if any simulator exists that evolves sequences with heterotachy; if not, then a simpler project is to create such a simulator.
• Modify some maximum likelihood method to take support values per site into account.

## Other

• Develop a method that can compute a tree from an cinomplete dissimilarity matrix (i.e., matrices where some of the entries do not have values). Compare to the "NJ*" methods from Criscuolo and Gascuel (2008), https://doi.org/10.1186/1471-2105-9-166, on some simulated datasets.

## People who can help

• Mike Nute, PhD student in the Warnow lab. Mike can help with any project related to multiple sequence alignment. Contact him at nute2@illinois.edu.
• Sarah Christensen, T.A. for the class and PhD student in the Warnow lab. Sarah can help with projects related to phylogenomics and supertree methods. Contact Sarah at sac2@illinois.edu.
• Erin Molloy, PhD student in the Warnow lab. Erin can help with projects related to maximum likelihood gene tree estimation. Contact Erin at emolloy2@illinois.edu.
• Pranjal Vachaspati, PhD student in the Warnow Lab. Pranjal can help with projects related to phylogenomics and supertree methods, and for methods for computing trees from incomplete distance matrices. Contact Pranjal at pr@nj.al.