Investigating the sox2 N-1c enhancer
The goal of this project is to give you a feel for the basic mechanics of
comparative sequence analysis, and also an understanding of some of the
possibilities and difficulties of genomic searches.
One effective way of finding regulatory regions in the genome is to
use comparative sequence analysis to identify conserved non-coding
regions near the gene or genes of interest. A computational toolset that
can be used to do this is FamilyRelations/Cartwheel, a project under
development here at Caltech.
There is a tutorial to help you learn how to
use FamilyRelations and Cartwheel.
Using FR/Cartwheel, answer the questions below. Please feel free to
contact Titus Brown (titus@caltech.edu) for technical
help.
Sox2
Sox2 is an important early regulator of chick midline development. In
this
paper (PubMed ID: 12689590) on sox2 transcriptional regulation in
chick, Kondoh's lab does an excellent job of first finding the
regulatory regions that control sox2 expression, and then verifying
that they are conserved. In this
follow-up paper (PubMed ID: 16354715), Takemoto et al. continue
that line of work to show the convergence of Wnt and FGF signals in
the activation of the N-1c enhancer.
Using the sequences contained in this file, sox2-project-sequences.txt,
do the following:
- Create a pairwise analysis group around the mouse and chick sox2 genomic sequences;
- Annotate the two genomic sequences with their respective gene sequences;
- Add a pairwise comparison with one or more of the following
programs: blastn, seqcomp, LAGAN-VISTA, or blastz.
- Isolate and extract (copy/paste) the entire conserved region
containing the N-1c element. Then add this sequence as a new sequence
in the Cartwheel folder.
(Tip: Use the FRII motif search (right-click menu) on the chick
sequence to type in the first bit of the N-1c DNA sequence from Figure
2(A) of Takemoto et al.; this will help you locate the N-1c element.)
- Go into the Cartwheel folder 'motif search' section and add exact
motifs matching each of the A, B, C, D, and E regions in Figure 2(A)
in the Takemoto et al. paper. Search for those motifs in the entire
genomic sequence as well as in the conserved sequence alone. Do you
find anything? Try allowing one mismatch; how many extra motif
matches do you find? Can you explain these results?
- Now build a Position-Weight Matrix out of the two sequences from A and B,
aligning them by hand. What threshold is the most specific in terms of
minimizing matches outside of the conserved region?
Specific questions:
- Is the N-1c enhancer sequence clearly discriminated from neighboring
sequence by the comparative sequence analysis technique you used?
- Given any one of the five different binding sites (A, B, C, D,
and E from Figure 2(a) in the Takemoto et al. paper) and using them as
simple motifs with Cartwheel's motif search, are you able to pick out
the N-1c enhancer from the chick genomic sequence? Does looking for
all five sequences together change anything?
- Does allowing one mismatch in each site change the specificity or
sensitivity of your motif searches at all? Explain why or why not.
- Can you use a Position-Weight Matrix approach to look for
the shared binding site in A and B, and is there a threshold at which
this approach is more specific (fewer false positives/other binding site
matches) than simple motif matching?