Comparative Sequence Analysis & Motif Searching

Investigating the sox2 N-1c enhancer

The goal of this project is to give you a feel for the basic mechanics of comparative sequence analysis and position-weight matrices. It also illustrates some of the possibilities and difficulties of genomic searches.

One effective way of finding regulatory regions in the genome is to use comparative sequence analysis to identify conserved non-coding regions near the gene or genes of interest. A computational toolset that can be used to do this is FamilyRelations/Cartwheel, a project originally developed here at Caltech.

There is a tutorial to help you learn how to use FamilyRelations and Cartwheel, and you should also watch the video screencast on weight matrix searching with Cartwheel. I've written a review of computational techniques used to find regulatory elements; the paper is available online here (PubMed ID: 18485306).

Using FR/Cartwheel, answer the questions below. Please feel free to contact Titus Brown ( for technical help.


Sox2 is an important early regulator of chick midline development. In this paper (PubMed ID: 12689590) on sox2 transcriptional regulation in chick, Kondoh's lab does an excellent job of first finding the regulatory regions that control sox2 expression, and then verifying that they are conserved. In this follow-up paper (PubMed ID: 16354715), Takemoto et al. continue that line of work to show the convergence of Wnt and FGF signals in the activation of the N-1c enhancer.

Using the sequences contained in this file, sox2-project-sequences.txt, do the following:

  1. Create a pairwise analysis group around the mouse and chick sox2 genomic sequences;

  2. Annotate the two genomic sequences with their respective gene (mRNA) sequences;

  3. Add a pairwise comparison with one or more of the following programs: blastn, seqcomp, LAGAN-VISTA, or blastz.

  4. Isolate and extract (copy/paste) the entire conserved region containing the N-1c element. Then add this sequence as a new sequence in the Cartwheel folder.

    (Tip: Use the FRII motif search (right-click menu) on the chick sequence to type in the first bit of the N-1c DNA sequence from Figure 2(A) of Takemoto et al.; this will help you locate the N-1c element.)

  5. Go into the Cartwheel folder 'motif search' section and add exact motifs matching each of the A, B, C, D, and E regions in Figure 2(A) in the Takemoto et al. paper. Search for those motifs in the entire genomic sequence as well as in the conserved sequence alone. Do you find anything? Try allowing one mismatch; how many extra motif matches do you find? Can you explain these results?

  6. Now build a Position-Weight Matrix out of the two sequences from the A and B regions, aligning them by hand. What threshold is the most specific in terms of minimizing matches outside of the conserved region?

Specific questions:

  1. Is the N-1c enhancer sequence clearly discriminated from neighboring sequence by the comparative sequence analysis technique you used?

  2. Given any one of the five different binding sites (A, B, C, D, and E from Figure 2(a) in the Takemoto et al. paper) and using them as simple motifs with Cartwheel's motif search, are you able to pick out the N-1c enhancer from the chick genomic sequence? Does looking for all five sequences together change anything?

  3. Does allowing one mismatch in each site change the specificity or sensitivity of your motif searches at all? Explain why or why not.

  4. Can you use a Position-Weight Matrix approach to look for the shared binding site in A and B, and is there a threshold at which this approach is more specific (fewer false positives/other binding site matches) than simple motif matching?