Investigating the sox2 N-1c enhancer
The goal of this project is to give you a feel for the basic mechanics
of comparative sequence analysis and position-weight matrices. It
also illustrates some of the possibilities and difficulties of genomic
searches.
One effective way of finding regulatory regions in the genome is to
use comparative sequence analysis to identify conserved non-coding
regions near the gene or genes of interest. A computational toolset
that can be used to do this is FamilyRelations/Cartwheel, a project
originally developed here at Caltech.
There is a tutorial to help you learn how
to use FamilyRelations and Cartwheel, and you should also watch the
video
screencast on weight matrix searching with Cartwheel. I've written
a review of computational techniques used to find regulatory elements;
the paper is available
online here
(PubMed ID: 18485306).
Using FR/Cartwheel, answer the questions below. Please feel free to
contact Titus Brown
(ctb@msu.edu) for
technical help.
Sox2
Sox2 is an important early regulator of chick midline development. In
this
paper (PubMed ID: 12689590) on sox2 transcriptional regulation in
chick, Kondoh's lab does an excellent job of first finding the
regulatory regions that control sox2 expression, and then verifying
that they are conserved. In this
follow-up paper (PubMed ID: 16354715), Takemoto et al. continue
that line of work to show the convergence of Wnt and FGF signals in
the activation of the N-1c enhancer.
Using the sequences contained in this file, sox2-project-sequences.txt,
do the following:
- Create a pairwise analysis group around the mouse and chick sox2 genomic sequences;
- Annotate the two genomic sequences with their respective gene
(mRNA) sequences;
- Add a pairwise comparison with one or more of the following
programs: blastn, seqcomp, LAGAN-VISTA, or blastz.
- Isolate and extract (copy/paste) the entire conserved region
containing the N-1c element. Then add this sequence as a new sequence
in the Cartwheel folder.
(Tip: Use the FRII motif search (right-click menu) on the chick
sequence to type in the first bit of the N-1c DNA sequence from Figure
2(A) of Takemoto et al.; this will help you locate the N-1c element.)
- Go into the Cartwheel folder 'motif search' section and add exact
motifs matching each of the A, B, C, D, and E regions in Figure 2(A)
in the Takemoto et al. paper. Search for those motifs in the entire
genomic sequence as well as in the conserved sequence alone. Do you
find anything? Try allowing one mismatch; how many extra motif
matches do you find? Can you explain these results?
- Now build a Position-Weight Matrix out of the two sequences from
the A and B regions, aligning them by hand. What threshold is the
most specific in terms of minimizing matches outside of the
conserved region?
Specific questions:
- Is the N-1c enhancer sequence clearly discriminated from neighboring
sequence by the comparative sequence analysis technique you used?
- Given any one of the five different binding sites (A, B, C, D,
and E from Figure 2(a) in the Takemoto et al. paper) and using them as
simple motifs with Cartwheel's motif search, are you able to pick out
the N-1c enhancer from the chick genomic sequence? Does looking for
all five sequences together change anything?
- Does allowing one mismatch in each site change the specificity or
sensitivity of your motif searches at all? Explain why or why not.
- Can you use a Position-Weight Matrix approach to look for
the shared binding site in A and B, and is there a threshold at which
this approach is more specific (fewer false positives/other binding site
matches) than simple motif matching?