Seqcomp Documentation

Seqcomp Program

Seqcomp is a small program than runs a simple yet exhaustive analysis of two genomic sequences.

Sequence Comparison Algorithm

The basis of the seqcomp algorithm is a comparison of a region, or window, of fixed size from one sequence against the same size window in the second sequence. Each base in the window from the first sequence is compared to the base in the window from the second sequence (eg 1st vs 1st, 2nd vs 2nd, etc). Each base that matches scores one point, each mismatch scores zero.

All windows from the first sequence are compared to all windows in the second sequence. The window from the first sequence is compared against all windows in the second, and then the second window in the first sequence (shifted forward one base) is compared to all windows in the second sequence, etc, until all possible windows have been compared against each other.

Algorithm Example

Take 2 sequences, A and B. Choose a window of some size, say 10 bp. Scan across all the windows in sequence A. For each window in A, compare it to all the windows in B.


A:
GGACGCCCCAGGACACGACTGCTTTCTTCACCACACCTCTGACAGGACAGGACAGGGAGGAGGGGTAGAG
B:
GGACAAATCAGGCCGGACAGGAGAGGGAGGGGTGGGGGACAGTGGGTGGGGATTCAGACTGCCAGCACTT



A1:
GGACGCCCCA
||||    ||     Score = 6
B1:
GGACAAATCA


A1:
GGACGCCCCA
|      |       Score = 2
B2:
GACAAATCAG


And so on, A1 vs. B3...Bn (n = length - window + 1)

And then A2 vs. B1...Bn

etc.


Keep the best 10 matches for each window in A. If a tie for 10th place, keep all with that score.



Command:

seqcomp human_mck_pro.fa mouse_mck_pro.fa 50 10000 25 1 HvM_ckm_small_pro.xml


Five Arguments:

First Sequence
human_mck_pro.fa
Second Sequence
mouse_mck_pro.fa
Window Size
50
Max Sequence to read in
10000
Hard Threshold
25
Output Type
1
Output File
HvM_ckm_small_pro.xml

Tristan De Buysscher, tristanfamily.caltech.edu