Comparative Sequence Analysis

Optional Bi188 Project

Two possible projects using FamilyRelations.

Due date: Monday, April 19th

The goal of these projects is to gain some understanding of how genomic DNA in general, and regulatory DNA in specific, has diverged between closely related species (such as mouse, human, and rat) and between more distantly related species (such as mouse and chick).

One effective way of finding regulatory regions in the genome is to use comparative sequence analysis to identify conserved non-coding regions near the gene or genes of interest. A computational toolset that can be used to do this is FamilyRelations/Cartwheel, a project under development here at Caltech.

There is a tutorial to help you learn how to use FamilyRelations and Cartwheel.

Using FR/Cartwheel, do one or both of the projects below. The Myf5 project is relatively easy; the Sox2 project involves more work. Note that your TA, Tracy, is away for the week of April 12th, and that Titus Brown (titus@caltech.edu) should be contacted regarding this project as well as any problems with FamilyRelations/Cartwheel.


Myf5

Myf5 is one of the earliest known myogenic regulators. Quite a bit is known about its transcriptional regulation; see e.g. this paper (PubMed ID: 11311165).

We have downloaded 200kb regions containing the myf5 gene from each of the human (Homo sapiens), rat (Rattus norvegicus), mouse (Mus musculus), and chick (Gallus gallus) genomes. Upload the regions to Cartwheel, find any match(es) to the mouse Myf5 protein, and use the pair view and triple view in FRII to answer some or all of the following questions:

  1. Are the regions found in the above paper conserved within the mammals?
  2. How about between chicken and mouse?
  3. How much of the DNA in the myf5 intergenic region (that is, in the region containing myf5 and its surrounding non-coding DNA, but no other genes) is conserved within the mammals?

Sox2

Sox2 is an important early regulator of chick midline development. In this paper (PubMed ID: 12689590) on sox2 transcriptional regulation in chick, Kondoh's lab does an excellent job of first finding the regulatory regions that control sox2 expression, and then verifying that they are conserved.

Using the chick Sox2 protein sequence, extract the gene regions from the mouse, chick, rat, and human genomes (use e.g. Ensembl).

In addition to the questions raised above for the myf5 project, answer some or all of the following questions:

  1. Empirically, comparative sequence analysis works best when used on homologous genomic regions. Are the regions you picked out homologous? How do you know? Also, sox2 is a member of the sox gene family, and it has many members in each of the above species; did you pick true orthologs, or paralogs, and how do you know?
  2. Based on the results of a 3-way analysis done with human/mouse/rat, do you agree with the paper when it states that the regions identified in the paper are not clearly identified by comparative sequence analysis within the mammals?

A few tips for using FRII/Cartwheel:
  • seqcomp analyses will (intentionally) fail when done on 200kb or more of sequence. If you're unsure of what region(s) to compare, use blastn to identify a subregion surrounding the gene and then run seqcomp on that region.
  • seqcomp analyses of 100kb by 100kb may take 5-10 minutes, per job. Cartwheel schedules jobs with a queueing system, so if you set up several jobs it may take an hour or two to finish all of them.
  • When doing seqcomp analyses on large sequences, start with a stringent threshold. I'd suggest a 70% threshold with a 50bp window. (seqcomp produces a lot of data. FRII should be able to handle it, but you will probably be bored well before it finishes drawing the results...)