FamilyRelations II: A Tutorial

(Comparative Sequence Analysis Made Easy, take II)

C. Titus Brown, titus@caltech.edu
Davidson Laboratory of Embryonic Gene Expression
California Institute of Technology

Last revised April 2004


Introduction

This is a tutorial for people who would like to use FamilyRelations II to compare two or three genomic sequences. It will lead you through several steps:
  • checking to make sure you can run FamilyRelations II [1];
  • creating an analysis on the Cartwheel Web server [2];
  • and viewing the analysis in FamilyRelations II [3].

But first, I should explain what's what!

FamilyRelations II (henceforth, "FR II") is a program which runs on Windows (and other computers) and lets you visually explore the results of various analyses; it's what you use once you've set up an analysis and want to see the results. (FR II is the successor to the original FamilyRelations; see the family.caltech.edu Web site for more information on that.)

Cartwheel is a Web site that lets you set up analyses. It acts as an interface between you and the various programs that are available for doing analyses, like BLAST and seqcomp, and keeps track of all of the information there; you can use FR II from any computer with an Internet connection and you'll always see the same set of analyses. Plus, we do the backups for you ;).

Note that we run a public Cartwheel site that you'll use in this tutorial.


Running FamilyRelations II

In order to run FR II, you'll need to download it for your computer. It's freely available, and runs on Windows, Mac OS X, and most UNIX computers. Go
here to download it, and then unpack it like you would any other downloaded software.

If you double-click on the FRII.exe application (on Windows) or on the FRII application (on Mac OS X), it should pop up a screen asking for a username and password. If you get this screen (shown below), congratulations -- it worked!

login window

If something went awry, you may need to find a local computer geek to go over things with you. Please don't contact me without first finding someone local, unless I am your local computer geek, in which case you should remember that I like beer. And chocolate.


Creating an Analysis Using Cartwheel

Now that you've done the hard part, you want to relax by actually looking at some data!

I'd like to start you off with some sequences that I know are interesting; please download el.txt and br.txt. These are orthologous chunks of sequence from the mec-3 regions of C. elegans and C. briggsae, respectively, that were kindly donated by Erich Schwarz. Make sure you save them somewhere where you can find them again!

Now, go to the Web page cartwheel.caltech.edu/ and then enter the public Cartwheel server site.

You are now at "the analysis site". On the front page of the analysis site, there's a button that says "create new account"; select it, fill out the form, and hit submit. Assuming no errors, go ahead and log in for the first time; you should now be presented with a list of labs. Pick "Public Space for Testing", unless your lab actually works on nematode mechanoreceptors.

The first thing you want to do is to create an analysis folder. Analysis folders contain related sequences and analyses within a lab; in general, for each new set of orthologous sequences, you'll want to create a new project. (There's no limit on the number of projects an individual lab can have.) There are also directory folders, which let you group your analysis folders (and other directory folders) under an overall project.

To create a project, find the box that says "New folder name" and enter something -- I'll call the project "wallawalla", which (hopefully) will distinguish it from a real project in your lab! Then click 'add folder'. You should see it appear on the list of folders in your lab. Click on the folder name, and you should now see a new screen -- the "analysis folder" page.

Within each project, you can do a bunch of different things. We're going to focus on getting started, which in this case means (a) uploading some sequences and (b) creating some analyses.

To upload some sequences, select the 'manage sequences' link. You should now be on a page with an empty list; hit "upload a FASTA file". Use the "Browse..." button to find the el.txt file that you downloaded at the start of this section. Now hit the 'upload' button. Note that Cartwheel will do its best to figure out what type of sequence it is (DNA or protein) unless you tell it what type it is.

You should see a log that found and added a one sequence. Return to the sequence list, and there it is: 'C. e. mec-3 promoter'.

Now add br.txt as well.

On your sequence list you should now have two sequences, each about 350 bp long, named "C. e. mec-3 promoter" and "C. b. mec-3 promoter". (If you don't, you skipped a step or something didn't work and you should have seen an error message...)

Now return to folder view. Now we want to set up a group of analyses; select create an analysis group. We're going to set up a pairwise analysis, so select 'Pairwise analysis of two sequences' from the list, and enter a name. ("Tutorial pairwise" is a simple suggestion here).

At this point, you can choose the sequences you want to compare. Pick the C. elegans sequence for the top, and the C. briggsae sequence for the bottom. Then hit 'set sequences'.

You are now in the analysis group view. This is the place where you can add annotations to either of the individual sequences -- such as cDNA locations, or motif searches -- and, most importantly of all, you can add pairwise comparisons.

In fact, let's do that. Hit 'add pairwise comparison'.

You now have two choices: you can either create a seqcomp analysis, or do a pairwise BLAST. A pairwise blast is the same as the two-sequence BLAST on the NCBI Website; seqcomp is a fixed-window comparison algorithm like those used to create dot plots. Let's start by setting up a seqcomp comparison.

Choose 'seqcomp', and hit 'create analysis'. Name the analysis 'seqcomp 20bp' (or whatever you want!).

Hit 'set name and continue'.

You can now set parameters for the analysis; in this case, the default parameters are probably fine. (See the Parameters section, below, to see what the seqcomp parameters are for.) So, hit 'submit to queue'.

At this point you should be back at the analysis group page, with one analysis -- the seqcomp -- on the list. The status should be 'running', although if you hit reload it will probably change to 'completed -- such small analyses are usually done quickly!

Now go back and set up a pairwise BLAST analysis by following the same steps as for the seqcomp analysis, but selecting BLAST. The only option on the BLAST page is analysis type -- you can pick either 'blastn' or 'tblastx'. Pick 'blastn'.

Now you should be back at the pairwise analysis group page; let's set up a single-sequence analysis. This is what you'll want to do if you have cDNA or protein sequence for coding regions that are present in your sequences, or if you want to put the matches to various databases on your sequence when it's displayed; for now, though, I just want to show you the basic principles, and so we're going to do something boring (but informative!): we're going to put BLAST features from the C. briggsae sequence on the C. elegans sequence.

On the top of the analysis group page, there should be links that say "edit sequence analyses" next to both of your sequences. Pick the link next to the C. elegans sequence.

Hit 'add analysis', and then pick 'Add BLAST matches from another sequence'. (You might want to look down the menu options here -- there are a lot of useful things, ranging from adding motif searches to searching against databases.)

Now set the sequence to run against the C. elegans promoter to 'C. b. mec-3 promoter'; leave the program at 'blastn'. Also remember to put in a name ('blastn single'). Submit the analysis to the queue, and return to the single sequence analysis list.

If you've followed instructions faithfully, you should see one single-sequence analysis in the list. When you click 'done' here, you will go back to the pair analysis group, and you should see two pairwise comparisons: a seqcomp, and a BLAST.

Now you're ready to take a look at the results of all your hard work!

Viewing an Analysis using FamilyRelations II

Start out by running FR II.

Log in to the Cartwheel server.

Go into your lab space (e.g. "Public Space for Testing"), find your analysis, and then double-click on it.

The main screen: dot plot & pairview

The first view you should see is the dot plot view. In this view, the coordinates of the first sequence in your analysis are displayed on the top, and the coordinates of the second sequence are displayed on your left. Points of similarity are graphed in the 2-D plot below and to the right of the sequences, respectively.

You can also switch to a separate view, called the "pair view", which presents the same data as in the dotplot in a different format. Use the tabs at the upper left of the screen to switch between the views.

There are a variety of things you can do on this screen.

  • On either of the sequences, or in the square in the middle of the dotplot, you can left-click and drag to select a region of the sequences.

    Left-click within the selected region to zoom in to that region; SHIFT-left-click will zoom back out.

  • Right-click or CTRL-click will bring up a menu; from the menu you can zoom/unzoom, copy the selected sequence into the paste buffer, and bring up a motif search window (described below).

  • On the right side of the window, there are control widgets for the various analyses you set up. On your 'seqcomp 20bp' analysis, you can increase the threshold at which matches are shown, as well as zoom to a closeup view of the sequence (described below); for BLAST matches, you can increase the expectation threshold. For all analyses you can turn off the display of features, as well as change the colors.

    Try moving the seqcomp threshold slider up to 80%; you should see some features disappear. For a 20bp window, this corresponds to requiring that 16 of 20 bp match before a match is displayed on the screen.

  • The controls for the pairwise analyses are shown in different sections than the controls for the single sequence analyses; use the tabs at the control window to switch over to the top sequence, and watch how the BLAST matches displayed in red on the top sequence go away as you increase the stringency.

Select & zoom to sequence

To actually look at some of the sequence matches, select an interesting patch of matches and click "View closeup" in the Comparisons tab on the right of the screen.

You should see a zoomed view in a new window within which you can slide around and view the actual matches.

Motif search

Another useful feature is the ability to search for motifs on the sequence. To open the motif search view, pull up the menu by right-clicking on a sequence & selecting "search for motifs..."

The motif search window allows you to search for up to five motifs. It lets you search for IUPAC-style motifs (e.g. "WGATAR") with up to five mismatches allowed anywhere in the sequence.

Triple view

It is also possible to set up a three-way analysis on Cartwheel; this does three analyses (sequence A vs B, B vs C, and A vs C). When loaded with FRII, the analyses can be filtered to show only matches that transitively match between all three sequences.

If you'd like to try this out, you can upload re.txt (the mec-3 region from yet another worm, C. remanei) to Cartwheel and set up a new analysis group; use the three-way analysis group, and set the C. e. sequence as the top sequence, C. b. sequence as the middle sequence, and the C. r. sequence as the bottom sequence. Because of the way filtering works, there is only one windowsize/threshold that can be set for all three comparisons; leave them at 20bp/70% threshold for this analysis.

Now load the analysis into FRII.

In this analysis, you can view the three-way analysis, both filtered and unfiltered, as well as each of the three pairwise analyses. The only new trick here is that when you're viewing the filtered three-way, changing any of the seqcomp threshold sliders affects the entire three-way.


The End

That's all, folks -- it's not very complicated, is it?

A few notes:

  • if you run into problems, or have any questions, please e-mail me. Odds are that if something breaks, it's my fault, not yours -- and it's certainly my problem!
  • if you have any requests or bright ideas for future features, let me know.
  • there are several other comparative sequence programs out there -- two of the best known are Vista and PipMaker. I encourage you to try them out, as well!

Appendix: Parameters for seqcomp

The seqcomp program does a dot-plot style analysis in which every window of a fixed size on one sequence is compared to every window of the same size on the other sequence. The number of matching base pairs is counted and all windows that have at least a given threshold number of base pairs are recorded for display by FR II.

(This is quite different from what a global alignment program such as CLUSTALW does, which is try to uniquely match each base pair on one sequence to a unique base pair on the other sequence. It's somewhat similar to what a local alignment program such as BLAST does, but BLAST allows for gapped matches and also prioritizes regions based on how well the whole region matches; with seqcomp, all matches are recorded as long as they're above the threshold you set.)

What windowsize and threshold you use is up to you. A few notes worth considering:

  1. the threshold you set is linked to the windowsize you use. A 70% threshold with a 20bp window is very different than a 70% threshold with a 50bp window: in the first case, you're asking for any windows of 20bp with 14 or more bp in common to be displayed; in the second case you're asking for any windows of 50 bp with 35 or more bp in common. The second case is considerably more stringent.
  2. at Caltech, we routinely use 20bp windows with 70% or higher starting thresholds, or 50bp windows with 50% or higher starting thresholds.
  3. don't go below a windowsize of 12 for any sequence larger than a kb or so. The statistics of pairwise matching mean that any matches between windows of that size are statistically meaningless.
  4. the amount of random background seen increases exponentially (like a bacterial population) as you decrease the threshold. It also increases like the product of the lengths of the sequences -- so you'll see 100 times as much random background with 100k x 100k sequences as you will with 10k by 10k sequences.

The thresholds/windowsizes below are the thresholds at which you have a 5% chance of seeing a match, in random DNA of the length given, to a window of the given windowsize. There is no point to going below these thresholds when setting up a comparison, and this is enforced in the Cartwheel server.

In the table below, windowsizes are on the left, and sequence sizes are along the top.

1000 10000 100000 1000000
10 80% 90% 100% 100%
15 73% 80% 86% 86%
20 65% 70% 75% 80%
25 60% 68% 72% 76%
30 56% 63% 66% 70%
35 54% 60% 62% 68%
40 52% 57% 60% 65%
45 51% 55% 60% 62%
50 50% 54% 57% 60%

The mathematical details behind this are available in a short unpublished article I wrote, available here. Please send me e-mail if you have any additional questions.


Credits and Acknowledgements

Tristan De Buysscher wrote the seqcomp program; Dr. Barbara Wold suggested that I write this tutorial; and my advisor, Dr. Eric Davidson, was integral to the design and implementation of the FamilyRelations program.

Please send notices of error and omission to Titus Brown, titus@caltech.edu. He welcomes them.