CS598/BioE598: Algorithmic Genomic Biology (2015)
Instructor: Tandy Warnow, Founder Professor of
Engineering
This is the description of the 2015 CS598/BioE598 AGB course.
The 2016 course description is located at
5982016.html.
Office hours: Mondays 10:0011:45AM in 3235 Siebel; however,
these will sometimes be moved due to travel or faculty candidate job talks.
No office hours Monday March 30, 2015 (replaced
by Monday, March 23, 2015, 1011:45 AM) or April 13, 2015 (replaced
by Tuesday April 14, 2015, 910:45 AM).
Extra office hour Thursday May 7, 2015, 10:3011:45 AM.
Course location:
TR 11:00 AM  12:20 PM, 1109 Siebel Center for Computer Science.
Course description:
The main focus of the course is on phylogeny (evolutionary tree) estimation,
but the course also covers
the related problems
of computing multiple sequence alignments,
genome assembly, and analyzing microbiomes.
Students will learn the mathematical and computational foundations in these areas, read the current literature, and do a team research project. The techniques involved include discrete algorithms, graph theory, simulations, and probabilistic analysis of algorithms. The course is appropriate for doctoral students in computer science, computer engineering, bioengineering, mathematics, and statistics; doctoral students in the biological sciences are also welcome, and will have different homework and exams.
Course textbook :
Introduction to Computational Phylogenetics.
This is a draft, and will evolve over the semester!
Prerequisites: No biology background is required, but students should have some mathematical maturity, and at least one undergraduate course in algorithm design, data structures, or probability theory.
Students from departments outside Engineering, CS, Math, and Statistics:
The course will be very mathematical, and being able to understand
proofs and design methods with provable guarantees is
the main point of this course. If you do not have the equivalent of
an undergraduate degree in math, statistics, computer science,
or some other engineering program, the course may not be
appropriate for you. However, if you are a PhD student doing
research in an area that relies strongly on phylogeny estimation,
then please see me to discuss whether this course is appropriate
for you.
Homework policy :
Homeworks are due in class by 11:10 AM; homeworks
handed in on time can receive full credit. Homeworks handed
in after the deadline but before 24 hours after the deadline
receive 80% of the grade. Homeworks handed in after 24 hours but
before 48 hours receive 60% of the grade. No homeworks will be
accepted after 48 hours past the deadline.
The single worst homework grade will be dropped.
You are encouraged to work with other students on the
homework, but if you do this, this is what you should do.
First, indicate on the homework who you worked with.
Second, do not look at the other homework solutions when you write
your own solutions; this includes not looking at
someone else's writeup of a critique of some literature.
Third, and more generally, must write your homework solutions
entirely on your own, using your own language. Please do not
under any circumstances copy homework solutions from anyone
else, or let anyone copy from you.
Academic integrity:
You are expected to abide by the university
academic integrity standards, which means (among other things)
that you should never copy anyone else's homework nor let anyone
copy your homework.
This is particularly important for your final project, especially
if you refer to the scientific literature in your project.
You must also never plagiarize, which
means (among other things) that any text that you
copy from another document must be properly
attributed (with quotation marks around the
copied material, and citation to the document from which you
have copied the material). Even paraphrasing can count
as plagiarism.
All violations of academic integrity standards will be
reported to the appropriate university offices. Serious
violations will result in a failing grade for the course.
Please see this page for a brief discussion of this issue, and
the real
academic integrity page.
Grading :
 Homeworks: 40% (one homework grade dropped)
 Quizzes: 10% (one quiz dropped)
 Final Exam: 30%
 Final Project: 15% (due 11:00 AM May 7, in hardcopy and also email).
 Class Presentation: 5%
Reading Assignments:
You are expected to do the assigned reading. This includes
assigned material from the textbook, but also the scientific
literature that I assign as reading.
Quizzes:
These will be short (10 minute) quizzes, meant to
confirm basic understanding of a technique or concept.
If you have done the reading and homeworks, they will
be straightforward.
The worst quiz grade will be dropped.
Final Exam: This will be a closed book,
comprehensive exam.
Please see this review document.
Final Project:
The course requires a final project of each student. You are strongly encouraged to do a research project, but you can also do a survey paper on some topic relevant to the course material. In both cases, your project should be a paper (of about 15 pages) in a format and style appropriate for submission to a journal. Research projects can involve two students, but survey papers must be done by yourself.
Grades on the final project depend upon the kind of project you do. For a research paper, your grade will be 30% writing, 40% scientific/algorithmic rigor, and 30% impact. If you do a survey paper, the grade will be 30% writing, 30% summary of the literature you discuss, and 40% commentary (i.e., insight, critical and thoughtful discussion of the issues that come up).
The final project counts for 15% of your course grade, and will be due
on May 7, 11 AM, in hardcopy and in email. You must give this to
me directly  in class or in my office hours.
See this page for a list of possible
final projects.
Class Presentation:
Pairs of students will present scientific papers
to the class. You will have
15 minutes for the presentation, with 10 minutes
for questions.
You will need to prepare a PPTX or PDF file for your
presentation one week before your scheduled
presentation, and will get feedback on the file from the
instructor before you give the presentation.
This counts for 5% of your final grade.
See list of papers that you have
each selected to present.
Course Schedule and Homework Assignments
 January 20, 2015. Introduction to course.
(PPTX)
(PDF)
 January 22, 2015.
Introduction to stochastic models of sequence evolution,
using the CavenderFarrisNeyman
(CFN)
model as an example.
Additive matrices and the Four Point Condition.
CFN distances.
Distancebased estimation under the CFN model, and why it is
statistically consistent.
The Naive Quartet Method (briefly).
Reading before class: Chapters 13 from textbook.
Also, The Hobgoblin of Phylogenetics, by David
M. Hillis, John P. Huelsenbeck, and David L. Swofford (Nature,
Vol. 369, 2 June 1994), one of the classic papers
in phylogenetics.
 January 27, 2015.
The Newick string representation of rooted trees.
Representation of rooted trees using
clades and triplet trees.
Constructing rooted trees from clades by constructing Hasse Diagrams.
Contructing rooted trees from rooted threeleaf trees, using Aho, Sagiv,
Szymanski, and Ullman.
Constructing unrooted trees from unrooted fourleaf trees using
the All Quartets Algorithm.
Reading before class: Chapter 4.14.2, 5.15.2
Homework 1:
Problems 2.1(2), 2.1(3), 2.1(4), 2.2(1), 2.2(4), 2.3(2), 2.6(1),
2.8(10),
3.3(8), and 3.3(11)
 January 29, 2015.
Algorithms for constructing unrooted trees from
compatible bipartitions and
additive distance matrices.
Maximum parsimony (MP) and
parsimonyinformative characters.
Why MP is not statistically consistent under the CFN model
(the Felsenstein Zone).
In class quiz based on homework.
Reading before class: Chapters 4.3, 4.4, 4.6, 6.1, 6.3, 7.17.9.
 February 3, 2015.
Compatible binary characters, and how to test for compatibility.
Maximum compatibility and equivalence to max clique.
Maximum parsimony, and how to compute the MP score of
a tree.
Finding optimal MP trees.
Consensus trees (strict consensus, majority consensus, and greedy).
Supertree methods and quartetbased tree estimation methods.
(PPT)
(PDF)
Reading before class: Chapter 6.56.6, 6.8.
Homework 2: Problems 2.8(12), 3.3(16), 3.3(17), 4.1(5), 4.1(6), 4.2(12), 6.3(1),
7.2(2),7.2(3), and 7.7(4)
 February 5, 2015.
Maximum likelihood, and how to compute the ML
score of a fixed tree. Finding optimal ML trees.
Bayesian estimation.
Stochastic models of molecular sequence evolution,
and statistical inference issues (identifiability,
statistical consistency, convergence rates, robustness).
Inclass megaquiz (equivalent of three quizzes) based on homework.
Reading before class: Chapter 9.19.13, and
the classic paper
"Cases in which parsimony and compatibility methods
will be positively misleading", by Joseph
Felsenstein, Systematic Zoology, Volume
27, No. 4 (1978), pp. 401410.

February 10, 2015.
Phylogenetic inference under stochastic models of sequence evolution:
what simulations tell us.
(PPTX)
(PDF)
Homework 3:
 Problems 6.4(2), 6.4(3), 6.6(3), 6.8(5),
9.3(1), 9.4(2), 9.8(1), and 9.10(1).

Find one paper that examines
accuracy of phylogeny estimation methods
on simulated data, and write 12 page
paper summarizing and discussing the paper.
Be prepared to discuss the paper and what you
think of it in class.

February 12, 2015.
Review of material to date.
In class quiz based on homework.

February 17, 2015.
Insertions, deletions, and pairwise sequence alignment. The NeedlemanWunsch algorithm
(see http://en.wikipedia.org/wiki/NeedlemanWunsch_algorithm).
Homework 4: find a published paper about quartetbased
tree estimation. Read it and critique it, and submit a
PDF paper (25 pages) with your critique. Make sure
to summarize the paper, and discuss whether you
agree with the conclusions, and what the contributions
of the paper are.
Remember  do not copy
any text from the paper without giving proper attribution!
(See the last page of the class presentation for February
3, 2015, for a list of suggested papers about quartetbased methods.)

February 19, 2015.
Multiple sequence alignment (Part 1):
Optimization problems.

February 24, 2015.
Multiple sequence alignment (Part 2):
MSA methods and their performance on data.
(PDF)
(PPTX)
Homework 5:

Read Chapter 10.110.3, 10.5.

Do problems
2.8(13),
10.1(1), 10.1(2), 10.2(1), 10.3(3), 10.3(5), 10.5(1),
and
10.5(2).

Read Anna Graybeal's paper about
taxon sampling at
this webpage.
Write a 35 page review, summarizing the
paper and discussing its relevance.
Identify some questions that are not addressed
in the paper (or insufficiently addressed), and
discuss how you might try to explore the question
in a subsequent simulation.

February 26, 2015.
Hidden Markov Models, and their use in multiple sequence
alignment (see http://www.cs.princeton.edu/~mona/Lecture/HMM1.pdf for
a very readable introduction to this material).
Supertree methods and their use in divideandconquer methods.
(PPTX)
(PDF)
Homework: Read Chapter 8.

March 3, 2015.
Largescale multiple sequence alignment methods using UPP (Ensembles
of Hidden Markov Models); guest lecture by Namphuong Nguyen (IGB).
Homework 6:
 Read Chapters 2.10, 5.3, 10.4, 10.610.8.

Read papers 84 (download it from
the library) and 95 from my publication list (available
here).
Think about how quartetbased methods can be used in
phylogeny estimation.
Write a paper (at least 3 pages) about
quartetbased methods for phylogeny
estimation, in which you discuss the
entire collection of papers you have read
about quartetbased methods.
Be prepared to discuss these issues in class

Read the following papers. (You will be asked to
write a paper about alignment methods later.)
 "Fast, scalable generation of highquality protein multiple
sequence alignments using Clustal Omega", by Sievers et al.,
DOI 10.1038/msb.2011.75  Published online 11.10.2011
Molecular Systems Biology (2011) 7: 539.
 "PASTA: Ultralarge multiple sequence alignment", by
Mirarab et al., RECOMB 2014,
(PDF)

March 5, 2015.
Phylogenomics (genomescale phylogeny estimation):
Inferring species trees in the presence
of Incomplete Lineage Sorting (Part 1).
(PDF)
(PPTX)
Discussion of
final projects.

Reading before class: Chapter 11.111.4

March 10, 2015.
Inferring species trees in the presence of Incomplete Lineage Sorting (Part 2)
(See presentation from March 5, 2015)
Homework 7:
 Page 55, from presentation (both problems)
 Problems 11.2(1)11.2(3), 11.3(4)
 Read the NJst paper,
"Estimating species trees from unrooted gene
trees", by L. Liu and L. Yu, Syst. Biol. 60(5):661667, 2011.

March 12, 2015.
Reticulate evolution (e.g., horizontal gene transfer).
See "Recovering the Treelike Trend of Evolution
Despite Extensive Lateral Genetic Transfer:
A Probabilistic Analysis" by
Sebastien Roch and Sagi Snir, 2013,
Journal of Computational Biology, Volume 20, Number
2, pages 93112.

March 17, 2015.
In class quiz.
Homework 8:
 There are many programs to do maximum likelihood
phylogenetic tree estimation under standard models (such as the
GTR model), and only a few that can do large datasets. But
a dataset can be
"large" in two ways: the number of sequences or the length of the
alignment. For datasets with
large numbers of sequences,
FastTree2 (Price et al., PLoS ONE, 5(3):e9490, 2010)
is very good. For datasets with long alignments,
ExaML (a variant of RAxML) developed by Stamatakis et al.
(IPDPS 2013 doi:10.1109/IPDPS.2013.70) is very good, largely
because it has good parallelization. But nothing does
well for datasets with both
dimensions being large  the number of sequences and length
of the alignment.
Imagine you have
a piece of code that will compute the GTR maximum likelihood
score of a given tree (so it will give you all the
parameters on the tree to optimize the likelihood), and
this is fast enough that you could use it periodically
(though not too often).
What algorithmic approaches would you try
to find good ML trees?
What would you do to test your method in comparison
to FastTree and RAxML?
 The more general problem is how to compute
trees from large multiple sequence alignments, with
tens of thousands (or more) of sequences.
One way to do this "quickly" is to compute
pairwise
distances
and then run neighbor joining; however,
even neighbor joining is somewhat slow (the best
implementation is quadratic or worse).
Do a literature search and find at least two
polynomial time methods that claim to be
faster than neighbor joining. Write up
a comparison of the papers describing the
methods, and what they show
about the accuracy of these methods
to other standard methods such as neighbor joining.
 (Due March 18, 2015) In preparation for your presentation (which takes
place in April), list two or three papers that you
would like to present, include hardcopies
of the papers, and write up one paragraph
about each paper. I will select one of these papers from
the set for you to present in class.

March 19, 2015.
Review of course to date.
Research discussion: What have
we learned? What needs to be done?
What do you want to do for your research project?
(Please see this list of papers
that you have each selected to present.)

March 31, 2015.
Introduction to genome assembly methods
(PPT)
(PDF)
Homework 9:
 Read this paper about deBruijn graphs and genome assembly.
Give an example of a DeBruijn graph, and show how you constructed it. (Do
not show one that you found in any paper you read.)

April 2, 2015.
Genome assembly methods, continued.
(PPTX)
(PDF)
Homework 10:
 Read
the paper by Mihai Pop at this
page about genome assembly challenges.
Write a two page summary about the paper.
 April 7, 2015.
Introduction to Historical linguistics (constructing phylogenetic trees and
networks from linguistic data)
(PDF)
(PPTX)
Homework 11:

Read the papers that Pranjal and Ashu will present, write a one paragraph
summary of each paper, and identify
one or two questions about each paper. Submit this by email  as I will
forward it to the students!

Answer the questions in
this
page. Note, you may need to do a literature search to
answer these questions. If you do, include a reference to the
source you used (e.g., published or unpublished
paper, wikipedia page) to answer the question.

Consider the following set of eight 3mers: AGT,AAA,ACT,AAC,CTT,GTA,TTT, and TAA.

Find the shortest common superstring for these 3mers.

Construct the digraph with 8 vertices, one for each 3mer (the Hamiltonian path approach). Find a Hamiltonian path in the graph. Is this path also Eulerian? Write the superstring corresponding to this Hamiltonian path.

Construct the digraph with one vertex for every 2mer prefix or suffix (the Eulerian path approach). Find an Eulerian path in the graph. Is this path also Hamiltonian? Write the superstring corresponding to the path.

Is there a string that has all possible 2digit numbers appearing exactly once, with digits taken from {0,1,2}? If so, find it. Explain your analysis.

Is there a string that has all possible 2digit numbers appearing exactly once, with digits taken from any finite alphabet? Explain your reasoning.

Find a shortest common superstring of the 8 3digit binary numbers (i.e., the set {000,001,010,011,100,101,110,111}.
 April 9, 2015.
Student presentations: Pranjal
Vachaspati (Jiang et al.,
SICOMP 2001) and Ashu Gupta (Avni et al., Systematic
Biology 2014).
Homework 12:
 Please see this review document.
 Read the papers for the following week.
For each paper, write a one paragraph summary of the paper, and
and write 12 questions for the student presenting the paper.
Submit this by email, as I will forward your questions to the
students!

Read Chapters 11 and 14 from the textbook.
 April 14, 2015.
Student presentations:
Shashank Yaduvashi (Chifman and Kubatko, Bioinformatics 2014) and
Jed Chou (Chifman
and Kubatko, arXiv 2014).
 April 16, 2015.
Student presentations:
Farzaneh Khajouei (Degnan and Rosenberg, TREE 2009)
and Yang Zhang (Salichos and Rokas, Nature 2013).
Homework 13: read the papers for the following week.
For each paper, write a one paragraph summary of the paper, and
and write 12 questions for the student presenting the paper.
Submit this by email, as I will forward your questions to the
students!
 April 21, 2015.
Student presentation:
Arjun Athreya (Price et al., PLoS ONE 2010).
 April 23, 2015.
Student presentations:
PeiChen Peng (Stamatakis et al., Bioinformatics 2005) and
Bryan Lunt (Siddharthan et al., PLoS Computational Biology
2005).
Homework 14: read the paper for the following week,
write a one paragraph summary of the paper, and
and write 12 questions for the student presenting the paper.
Submit this by email, as I will forward your questions to the
student.
 April 28, 2015.
Student presentation:
Jeremy Kemball, Joly et al., The American Naturalist, 2009.
 April 30, 2015.
Review Chapters 17 for exam;
please see this review document.
 May 4, 2015. 6 AM is deadline for submitting draft of
final project for feedback.
 May 5, 2015. Last day of class.
Review Chapters 811 for exam.
 May 7, 2015. Office hours 10:3011:45 AM.
(Final projects due in hardcopy in my office by 11:00 AM)
 May 8, 2015. Final exam (closed book),
710pm in Room 1109 of Siebel.