CS/BioE 589AGB final projects and presentations
There are two types of final projects you can do:
a survey paper or a research paper.
If you do a survey paper, you will do this by yourself, but if you
do a research paper then you can do this with someone else.
When the final project involves two people, each person must
be equally involved in the project, and be able to answer questions
about the project.
You have a lot of freedom
in what you do, and can pick something on your own. However,
this document should help you think of
things you might want to do.
The class presentation you do (between March 31 and April 14)
needs to be closely connected to your final project. Therefore,
you should have a pretty clear plan for the final project
when you pick the paper for your class
While you are thinking about the general area for a final project,
look at this.
If you want to do a research project,
please come see me to discuss
the possible projects.
Datasets for studying methods can be obtained
old lab webpage, as well as from other projects.
Here are some simple examples of research projects:
Biological dataset analysis
The projects in this category involve
using a biological dataset (possibly one that has already
been published and studied) and analyzing it
using a number of different pipelines, in order
to understand how method choice impacts discovery.
For example, in the context of gene tree estimation,
you could vary the choice of multiple sequence
alignment method and phylogeny estimation method, and
you could also look at co-estimation methods (that estimate
alignments and trees together, such as BAli-Phy), or
even alignment-free estimation.
In the context of multi-locus species tree estimation,
you could vary the methods used to estimate species
trees (co-estimation of gene trees and species trees,
summary methods, or single site methods).
In each case,
look at how changing your input data impacts
the analysis. For example, how does
taxon sampling impact the result? How does removing
masking noisy sites (using various techniques, such as
alignments impact gene tree estimation?
For species tree estimation from multi-locus data,
deleting loci with poorly supported
gene trees or collapsing low support branches in
gene trees in a species
tree estimation using summary methods
impact the final tree?
Focus on understanding the choice of method and the properties
of the data impact the final biological discoveries.
This particular project type is a natural outcome
of Problem Set 3 from
the midterm, but you'd probably explore more
variations in the estimation process, and you'd also
explore the impact of data modification on the anaysis.
You might also be interested in how the choice of gene
tree estimation procedure (which includes the alignment estimation
and tree estimation steps) impacts the detection of
selection. It has already been observed that
over-alignment (where sites contain non-homologous
nucleotides) can result in
the false positive detection of positive selection; however,
as far as I know,
the impact of under-alignment has not been studied.
And it is not clear how the phylogeny estimation step
impacts this question. Similarly, the impact of
gene tree estimation
or species tree estimation
procedure on other biological questions (e.g.,
predicting function or structure in proteins)
has not been very well investigated.
Explore the impact of taxon identification method
and dataset choice
on microbiome analysis.
For example, you could use
different taxonomic profiling methods
(e.g., KRAKEN and TIPP) or using different data (16S only,
or metagenomic data using whole genome shotgun sequences),
or based on different sequencing technologies (Illumina,
PacBio, or other
longer read sequencing technologies).
Some statistical methods
for multiple sequence alignment
(e.g., BAli-Phy) seem to perform very well on simulated data, but
we don't know how well they performs on biological datasets.
Take one of these methods (e.g., BAli-Phy, or perhaps
PAGAN or Prank, but there are others) and compare it to
leading alignment methods (MAFFT,
PASTA, etc.) on biological datasets with benchmark alignments
based on structure.
Many of these methods (e.g., BAli-Phy)
use MCMC and
are computationally intensive; hence, this should be
limited to small datasets (at most 25 sequences). You can
subsample from larger datasets with structural alignments
to produce these smaller datasets.
Exploring methods on simulated data
With biological datasets, one rarely really knows
the true phylogeny or multiple sequence alignment,
and so evaluating the impact of method choice on
final biological discovery is complicated. If you
wish, therefore, you can use simulated datasets to
explore performance of methods.
Many papers provide links to published simulated
datasets that you can use.
Examples of questions you might address on
simulated data include:
- Evaluating the impact of missing data on phylogeny estimation.
For example, if you have the true alignment of a set of sequences,
now delete a large fraction of the sites in a single sequence x,
how does this impact phylogeny estimation? In particular, does
it impact the accuracy of the tree topology? Does it impact the
branch length estimation, in particular of the branch leading to the
leaf for x?
- Determining if alignment-free phylogeny estimation methods
can be as accurate as
good phylogeny estimation methods (e.g., to two-phase methods, or to
If so, under what conditions?
- Determining how different MSA methods
and tree estimation methods
impact the estimation of parameters of the
model tree beyond the topology -- such as
lengths and the GTR matrix.
- In PNAS 2013,
Bouchard-Coté and Jordan
presented a new method for co-estimating multiple sequence
alignments and trees; however
that method was not studied in comparison to
other co-estimation methods nor to
good two-phase methods for estimating trees from
unaligned sequences. Therefore, a comparison of
their method to good alternatives would help us
evaluate whether their method is competitive.
Nearly all studies have explored accuracy under
sequence evolution models that only include
substitutions and simple indels; yet evolution also
events, such as tandem duplications and rearrangements.
Hence we do not know how well methods perform under
more realistic models.
Find a sequence evolution simulator that
includes these more complicated processes, and
methods for estimating gene trees (either
two-phase methods or co-estimation methods) from
unaligned sequences on data generated by these simulators.
often is based on a single
point estimate of the multiple sequence alignment, and
multiple sequence alignments impact the accuracy of the
phylogeny. See if you can find ways to improve the
estimation of the multiple sequence alignment by
combining MSAs. Some methods, such as
T-Coffee, are designed for this.
Explore their performance in terms of improving the MSA estimation,
and the impact that new MSA has on the phylogeny.
This can be done on both biological and simulated
- Explore the impact of alignment masking
(as in GBLOCKS and similar methods)
on multiple sequence alignment methods like
Prank, Pagan, and UPP that have a tendency to under-align.
This can be done on both biological and simulated data.
- Explore the impact of including rogue taxa in
a dataset by adding random sequences to your input sequence dataset.
Concretely, suppose you have a set S of homologous
sequences, and you add a random sequence x to the dataset, creating
a larger set S'.
Construct a ML tree T on S and
then also construct an ML tree T' on S'.
Now delete x from T', so that it is a tree on S.
How similar are the two trees?
What happens if you have more than a single non-homologous sequence?
In other words, does the inclusion of random sequence data
impact the phylogeny estimation?
This is likely to depend on the methods you use, so
consider different ways of computing alignments (e.g., MAFFT,
and UPP) and trees (e.g., Neighbor Joining and ML).
- Can we detect non-homologous sequences in datasets?
Even if the inclusion of non-homologs does not impact
phylogeny estimation (and it might!), the inclusion of
non-homologs in a phylogeny is at a minimum misleading.
Can we detect these non-homologs and delete them?
Consider the use of phylogenies (finding
long branches) and multiple sequence
alignment methods (e.g., UPP) for this purpose.
- We would like to know if phylogeny estimation
methods (such as maximum
likelihood, neighbor joining, etc.)
are biased in terms of
topological shape. For example, some methods may
tend to make trees that are imbalanced, and perhaps others
will tend to make trees that are balanced. There are
methods for measuring how balanced a tree is, which
can be used to test methods for being biased.
Imagine you generate a model gene tree topology and calculate
the measure of balance. Then you simulate evolution down the
tree and estimate a gene tree from the sequences you simulate; this
estimated tree also has a measure of balance. By changing
how you compute gene trees (e.g., neighbor joining,
maximum likelihood, etc), you can assess whether the method is
biased towards some kind of topological shape.
Measures to consider include the COLLESS measure of tree
balance, and the beta-splitting model of Aldous.
Write code to compute the COLLESS measure of a
given rooted gene tree.
Write code to compute the beta parameter for a given
rooted gene tree.
- Explore the impact of the rate of evolution on being able
to estimate large trees. You should do a simulation study
with indels and substitutions, and then systematically scale the tree up and down,
and explore what happens with a poor alignment method, a good
alignment method, and the true alignment. See if there is a
"sweet spot", and characterize the empirical statitstics of
the range in which the results are optimized.
- Find a simulator for a sequence evolution
model that is for models like the General Markov
Model (which contains the GTR model), or some other
more complex model than GTR. Explore the accuracy of
tree estimation methods under this more
complex model (e.g., maximum parsimony,
neighbor joining, and maximum likelihood
under simpler models).
In other words, simulate under
a more complex model, and then estimate under the simpler
model. (You can also approximate this by simulating
sequences under different GTR parameters but the same model
tree topology, and concatenating the alignments; it wouldn't
be the same way of exploring robustness, but it would be
getting at a similar question.)
- Test different tree estimation methods (such
as FastTree, RAxML, neighbor joining, and maximum parsimony) on
datasets with fragmentary sequences, to determine
whether the two methods behave differently.
Things to evaluate: tree topology and branch length estimation.
- Evaluate the impact of correcting distances or
not correcting distances on phylogeny
estimation. Be sure to include datasets with different
rates of evolution.
- Find a simulator that evolves gene trees within
a species tree under a duplication
and loss scenario, and test methods for computing
species trees from gene trees on datasets you generate.
You can consider many types of
methods, as long
as they can handle multiple copies
of species inside each gene tree;
examples of such methods include MulRF, DupTree, and iGTP.
Evaluate the impact of "missing data"
on species tree estimation methods, i.e.,
methods that combine estimated
gene trees into a species tree.
Here the missing data occur when
not all of the the given gene trees contain
all the species.
New method development
Multiple sequence alignment
- Imagine the following divide-and-conquer style of
multiple sequence alignment.
The input is a set S of unaligned sequences.
1. Divide into two parts (somehow!).
2. Align each part using your preferred MSA method.
3. Build a profile Hidden Markov Model on each of the two MSAs you
4. Align the two profile HMMs.
Compare the result you get to what you would get
by using your preferred MSA method on the full dataset.
(Note, this is a very under-specified method - so you'd need
to explore the design space.)
See if you can develop improved ways of combining multiple
sequence alignments to get a better (more accurate)
For inputs, you can use
methods like PASTA and SATé that produce many multiple sequence
alignments for a given input of unaligned sequences,
but you can also use any multiple sequence alignment method.
Compare your method to techniques like T-Coffee that
are designed for this.
Explore the performance in terms of improving MSA estimation,
and the impact that new MSA has on the phylogeny.
This can be done on both biological datasets (that have
structural alignments) and simulated
Gene tree estimation
The estimation of very large trees (with more than 10,000
sequences) is almost always done through standard two-phase
methods: first align, then compute a ML tree on the alignment;
even PASTA and SATé compute ML trees on the
alignments they compute in each iteration.
Yet this approach may not be scalable to large
datasets. Can we improve this through divide-and-conquer?
Suppose we have simplify the problem and assume we have
a multiple sequence alignment: can we develop a fast and
of computing trees from the alignment that is
as accurate (hopefully) as running FastTree-2 or RAxML
on the alignment?
Consider the following divide-and-conquer style of
tree estimation, given an input set of unaligned sequences:
The input is a set S of sequences in an alignment.
1. Divide into two overlapping parts (somehow!).
2. Construct a tree on each part using your preferred method.
3. Merge the two trees into a supertree, using a preferred
Note this is a very under-specified method, so you'd need to explore the design space.
The objective here is to have good accuracy on very large
datasets. Your exploration of this should examine the largest
datasets you can, but this is clearly going to be impacted by
the computational infrastructure you have available.
However, you should definitely
compare the tree you get to what you would get
using FastTree-2 on the alignment, since FastTree-2 is
a very fast and relatively accurate ML method.
ASTRAL is designed for combining unrooted gene trees into
an unrooted species tree using
Modify ASTRAL to work with rooted gene trees, and test it.
ASTRID is a modification of NJst
that is faster and can handle missing data. However,
it is also different in that it uses FastME to compute
species trees instead of NJ.
See what happens if you replace FastME with other
distance-based methods (e.g., the distance-based method
See what happens if you modify ASTRID so that it
is based on a different way of calculating
the distance matrix.
The question here is whether we can use species trees (either
known or estimated) to improve the estimation of gene trees.
This is a general topic of great interest, but the techniques
depend on the causes for gene tree incongruence with the species
tree (e.g., duplication/loss scenarios, incomplete lineage sorting, etc.).
Find methods that "correct" gene trees using species trees,
and evaluate how well they work. (Consider true vs. estimated
species trees, and also gene trees that are estimated with
with low to high error.)
- I hypothesize that maximum likelihood bootstrap gene trees are
less accurate estimates of the true gene tree
than the best maximum likelihood tree for gene sequence
alignment. Run an experiment to test this. Visualize the
results with MDS.
Do the same thing with MrBayes, using the sample from the distribution
produced by MrBayes.
What percentage of the sample are closer to the true tree, the same distance,
or further? Does this depend on the model tree properties and sequence
visualization tools for large trees. What is each tool good for?
- Compare visualization tools for large multiple sequence
alignments. What is each tool good for?
- We would like to have visualization tools that
can compare two large trees, and identify places in the tree
where they are different. Do such tools exist? If so, find them
and evaluate how well they work.
Writing a good survey paper is not trivial. You will need
to understand the papers you are reading and have some insights into
the different contributions made by different papers.
The quality of your writing is very important, and you should
think of this as something that you would be willing to submit to
a journal in the form that you submit it for a grade. That means,
among other things, no typos, no grammatical mistakes, a
proper bibliography (with full bibliographical information), and
Also hand in hardcopy of the main papers you reference.
Be careful, of course, not to include any text from any
other paper, unless you put quotes around it and properly attribute it.
When you write a survey paper, you need to specifically
you are interested in, and why
it is interesting and important.
You should explain controversies (if any),
the leading approaches,
and the evidence in favor or against each approach.
You need, as always, to really be critical - not necessarily
just accepting what the authors say, but pointing out
limitations of their approach.
Examples of possible topics for a survey paper include:
- Ultra-fast methods for distance-based phylogeny estimation
- Alignment-free tree estimation methods
- Methods for detecting horizontal gene transfer
or constructing species trees in the presence of HGT
- Methods for estimating species trees from gene trees
when gene trees can differ due to incomplete lineage sorting
- Methods for estimating species trees from gene trees
when gene trees can differ due to duplication and loss
- Models of evolution that are more complex than GTR,
and so allow
(for example) for dependencies between sites
- Techniques for dating ancestral nodes
- Techniques for inferring ancestral sequences
- Genome-scale multiple alignment methods (taking rearrangements into account)
- Genome rearrangement phylogeny (taking rearrangements into account)
- Methods for detecting remote homology
- Methods for masking noisy sites in multiple sequence alignments
- Methods for combining information from a collection of
multiple sequence alignments