Homework for CS 581, Spring 2018
Homework policies
Due date:
All homework is due at 2 PM on the due date, via Moodle (unless
otherwise specified).
Late homeworks (up to 48 hours late) can be accepted for reduced
credit (see course webpage for details), except when otherwise
specified (e.g., see Homework 5).
Collaboration policy:
You are expected to write up the homework yourself, but you are
welcome to discuss the homework with other students in the class.
If you discuss the homework with other students, clearly specify this on
your homework.
Reading assignments:
Some homeworks involve homework problems from the textbook, and
many homework assignments involve
reading the textbook or published papers.
The class discussion depends on you doing the reading,
as I will not be teaching all the material.
Review questions:
The textbook has two types of questions: review questions and
homework problems.
In general, I will be assigning problems from the
homework problems and not from the review questions (although
do note that this is not true for some homework assignments).
We may discuss
the review questions in class, so
please look over the review questions as well.
Optional homework:
You are welcome to submit solutions to problems
from the textbook
that interest you.
These won't count towards the grade, but
I'll personally grade them and give you feedback.
I'm recommending some of them in the homework list.
Similarly, you are encouraged to read papers from the
literature (perhaps a paper cited in the textbook), and
write up a discussion of the paper.
If one of the papers really excites you, you might want
to present it in class!
(Everyone has to present a paper, so you could get a head
start on this, by finding a paper you want to present instead
of being assigned one.)
Disputing a grade:
Please come see me directly if you have questions
or concerns about how your homework was graded.
Grading policy:
All weekly homeworks count for 100 points,
and contribute 25% of the course grade.
The worst grade is dropped.
Assignments
 Due Tuesday, January 23, 2018.
 Homework 1. Due Thursday, January 25, 2018.
 HW problems:
Chapter 1 problems 1, 5, and 10.
Chapter 2 problems 1, 5, and 21.
Chapter 3 problems 3 and 22.
 Optional (won't count towards grade):
Chapter 1, problem 11.
Chapter 3, problem 9.
 Homework 2. Due Thursday, February 1, 2018.

Read: Chapter 4

HW problems: Chapter 4, problems 1, 5, 6, and 17

Optional (won't count towards grade): Chapter 4, problem 18

Homework 3.
Due Thursday February 8, 2018.

Read: Chapter 8

HW problems: Chapter 8, problems 8, 9, 13, 14, and 17.

Apply the Sankoff algorithm for maximum parsimony on a fixed tree
(see Chapter 4.3) to the following input:
 Rooted tree T: ((a,e),(b,(c,d)))
 Character states: a=0, b=0, c=1, d=0, e=1
 The cost matrix M has M[0,0]=M[1,1]=0, M[0,1]=2, and M[1,0]=1.
(a) Letting r denote the root of T, compute Cost(r,0) and
Cost(r,1), and show all your calculations.
(b) What is the parsimony score of this input?
 Apply the brute force algorithm to compute
the probability of a site pattern to the following
input:
 CFN Model tree T has topology (a,(b,(c,d))).
The edge above the LCA of (c,d) has substitution
probability 0.4, and all other edges have substitution
probability 0.01.
 Character states: a=0, b=0, c=1, d=1.
(a) Compute the probability of this site pattern on this tree.
Show all your work.
(b) Is there another site pattern that has higher probability?
If so, find it  and otherwise explain why you think
there isn't such a site pattern.

Optional (won't count towards grade): Chapter 8, problem 18

Due Tuesday, February 13, 2018.

Homework 4.
Due Thursday, February 15, 2018.
 Read Chapters 9.19.5.

HW problems: Chapter 9, problems 1, 3, and 4

Homework 5.
Due Thursday, February 22, 2018.

Read Chapter 9.69.16 and Appendix C.

Read the following papers. For each paper, write a brief summary (2 paragraphs, maximum) and pose two questions or critiques. The two questions can either be requests for clarification, a critique of the paper (of the methodology or conclusions), or a suggestion for followup research.
Pay close attention to the methodology used to evaluate the methods that are presented.
Be prepared to discuss the papers in class.

"Who watches the watchmen?", by Iantorno et al. (2014), DOI 10.1007/9781627036467_4

"PhylogenyAware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis", by Loytynoja and Goldman (2008), DOI: 10.1126/science.1158395

"Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega" by Sievers et al. (2011), DOI: 10.1038/msb.2011.75

"BAliPhy: simultaneous Bayesian inference of alignment and phylogeny",
by Suchard and Redelings (2006), DOI: 10.1093/bioinformatics/btl175

"MCoffee: combining multiple sequence alignment methods with TCoffee" (2006) by Wallace et al. DOI: 10.1093/nar/gkl091.

"SplitInducing Indels in Phylogenomic Analysis by Donath and Stadler (2011)
(PDF)
(retrieved from Semantic Scholar)

HW problems:
Chapter 9, problems 11 and 13.

Due Tuesday, February 27, 2018.

Read Chapters 6.16.2, 10.110.4.

Homework 6.
Due Thursday, March 1, 2018.
 Read Chapter 10.5.
 Chapter 6, Review question 1.
 Chapter 6, Homework problems 4 and 8.
 Submit a 2page critique of the paper by Donath
and Stadler (from February 22, 2018 list of papers).
 Submit a 2page critique of any other paper
from the February 22 list of papers.

Due Tuesday, March 6, 2018.

Due March 9 (Friday):

12 page proposal for your final project (noon, via Moodle).

Make an appointment to meet with me in person to discuss
your ideas
before you submit the proposal.

Meet with me again after I have given you feedback about
the proposal.

Homework 7.
Due Thursday, March 15, 2018.
 Chapters 7 and 10, all review questions.
 HW problems: Chapter 7, problem 2. Chapter 10, problems 1 and 2.
 Select two papers that are
closely related to your final project proposal, and
write a 35 page review of the papers where you discuss
what they agree on, what they differ on, and what they leave
open. Make sure to critique the two papers carefully.
Your review should have full citations and format, and
will be evaluated not only for content but also for writing
(i.e., correct all spelling, grammar, and punctuation).
 Due March 16 (Friday).

Approved final project proposal (by noon, via Moodle)

Homework 8.
Due March 22 (Thursday).

Read Chapter 5.15.11.

Do all review questions from Chapter 5.

Do problems 9, 12, 15, 16, and 18
from Chapter 5.
 Homework 9. Due May 1 (last day of class, late submission not allowed)
 Develop an algorithm, implement it, use it to analyze
one or more datasets (as described below), and then write up a
report on the algorithm and what you observed on the data,
for a clustering problem, as described below.
Please note that your code must be made available to Sarah Christensen for her to test it on additional datasets. Hence, you are required to provide your report in MOODLE and your commented code to Sarah Christensen by email.

The input will be a set of 1000 to 10,000 unaligned DNA sequences,
drawn from one of the simulated datasets (ROSE, Indelible, or RNASim)
studied in the SATé and PASTA papers.
Please see https://sites.google.com/eng.ucsd.edu/datasets/pastaupp for RNASim and Indelible, and
https://sites.google.com/eng.ucsd.edu/datasets/satei for ROSE datasets.

The output will be a collection of clusters (disjoint or nondisjoint,
depending on the purpose of the clustering).
Each cluster should have at least 100 sequences and at most
10% of the input sequences.
 Constraints:
You are not allowed to compute a multiple sequence alignment
on the full dataset (although you are allowed to compute
alignments on small subsets, if you wish). You should
not use PASTA, SATé, or any other existing divideandconquer
approach in your algorithm.

You should run your code
on one replicate
of the ROSE 1000M1 datasets (from the SATé paper) and make
sure it runs on your own
laptop or in some other low memory environment.
You may also want to run your code on one replicate of the
Indelible 10,000M1 datasets (from the PASTA paper).
Report the running time, memory usage, and return the output of your code
(clustering of the dataset into subsets).

Note: there are two purposes for this clustering: multiple
sequence alignment (which needs disjoint clusters) and
tree estimation (which needs overlapping clusters).
So you might also want to follow through on your
algorithm design by seeing how well your clusters work for
a divideandconquer MSA estimation or a divideandconquer
tree estimation protocol.
But if you do this, it would be part of a final project,
not for this homework.