1997-98 by
Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered
at once with several sequence formats.
Format:
Input sequence format. Various standards
are supported.
Sequence type:
Input sequence type
Purge sequences (highly recommended)
When checked,
large duplicated regions (>= 40 bp alignment with less than 3
mismatches)) are filtered out before analysis. Purging is essential
for any motif discovery process, to avoid a bias due to
non-independence of sequences. Purging is performed with the programs
mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).
Oligonucleotide size:
The analysis can be performed with oligonuleotides of any size between
1 and 8. Selecting size 1 amounts to counting the alphabet utilization
within the input sequences. For the detection of regulatory sites, w
recommend starting with an analysis of hexanucleotides (size=6), and
scanning sizes between 4 and 8. When a pattern is significantly
overrepresented, it generally appears from the analyses with various
sizes.
Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide
are summed on both strands. This allows to detect elements which act
in an orientation-insensitive way (as is generally the case for yeast
upstream elements).
Group reverse complement together in the output
(only
valid for two strand analysis). This parameter does not affect the
counting itself, but only the format of output. If this option is NOT
checked, two separate lines are used to show a word and its reverse
complement. This is redundant but might be useful for compatibility
with other programs.
Prevent overlapping matches
Periodic patterns (e.g. AAAAAA, ATATAT) have an aggregative tendency,
i.e. each occurrence of such a pattern strongly favors additional
occurrences in its immediate vicinity. This introduces a bias to most
statistics (binomial, log-likelihood). A simple way to correct for
this bias is to prevent counting twice mutually overlapping
occurrences.
For example, TATATATATATA would represent
Background model
Various probabilistic models can be used to estimate the expected
frequency of each oligonucleotide.
Attention ! The results will be dramatically affected by the choice of expected frequency, which is the main specificity of this program. It has been shown that for the detection of regulatory sites in yeast upstream sequences, the best choice is to estimate the expected oligonucleotide frequencies on basis of the frequencies observed in the set of all upstream non-coding sequences from the genome. For the same purpose, choosing "equiprobable residues" would be totally inefficient, and "Residue frequencies from input sequence" works poorly.
Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all oligonucleotide frequencies (from size 1 to 8) in different sequence types, and this for each organism.
For example, with a Markov chain of order 4 :
P(GATAAC) = P(GATAA) * P(C|GATAA)
= P(GATAA) * P(ATAAC) / P(ATAA)
Expected(GATAAC) = observed(GATAA) * observed(ATAAC) / observed(ATAA)For words of size k, the highest possible order is k-2. A Markov order of 0 amounts to use observed residue frequencies for calculating expected oligomer frequencies (no dependency between neighbour residues).
The higher the Markov order, the most stringent is the analysis: specificity is increased, but there si a loss of sensitivity, i.e. some relevant patterns might be overlooked. The optimal Markov order depends on the size of the sequence set. For small gene families (e.g. 10 sequences of 800bp), taking an order > 1 would result in a loss of sensitivity. For sequence sets of 1Mb, a Markov chain of 3 is optimal for hexanucleotides.
GATAAG G & ATAAG GA & TAAG GAT & TAG GATA & AG GATAA & GThe expected frequency of each segmented pair is the product of expected frequencies of its members. The expected word frequency is the maximum expected pair frequency.
You can upload your own table of expected frequencies. This option can be useful if you are working with an organism which is not supported on the web server.
File format: The expected frequency file must be a tab-delimited text file, with one row per oligonucleotide. The first column contains the oligonucleotide, the second column the expected frequency. Oligonucleotides must be of the size selected for the analysis. Examples can be found in the Data folder.
How to generate an expected frequency file ?
An expected frequency file can be generated with oligo-analysis
itself.
Pseudo-frequency for the background model
When the background frequencies are based on a small sequence set, there is a risk to observe in the test sequences some oligomers which were totally absent from the background sequences. This would make a problem since these words are considered to have a 0 probability.
To circumvent this problem, a pseudo-frequency can be defined, which must be a number between 0 and 1. Expected frequencies are then corrected by a pseudo-frequency, which is the pseudo-weight divided by the number of possible patterns.
EXPECTED OCCURRENCES
S
Exp_occ = p * T = p * SUM (Lj + 1 - k)
j=1
where p = probability of the pattern
Several models are supported for estimating the
prior probability (see options -a, -expfreq and
-bg).
S = number of sequences in the sequence set.
Lj = length of the jth regulatory region
k = length of oligomer
T = the number of possible matching positions.
PROBABILITY OF SEQUENCE MATCHING
The probability to find at least one occurrence of the pattern within
a single sequence is :
T
q = 1 - (1-p)
with the same abbreviations as above
EXPECTED NUMBER OF MATCHING SEQUENCES
In this counting mode, only the first occurrence of each
sequence is taken into consideration. We have thus to
calculate a probability of first occurrence.
Exp_ms = n (1 - (1 - p)^T)
with the same abbreviations as above
Correction for autocorrelation (from Mireille Regnier)
Exp_ms_corrected = n (1 - (1 - p/a)^T)
Where
a is the coefficient of autocorrelation
PROBABILITY OF THE OBSERVED NUMBER OF OCCURRENCES (BINOMIAL)
The probability to observe exactly obs occurrences in the whole family
of sequences is calculated by the binomial
obs T-obs
P(obs) = bin(p,T,obs) = T! p (1-p)
---------------
obs! * (T-obs)!
where obs is the observed number of occurrences,
p is the expected frequency for the pattern,
T is the number of possible matching positions,
as defined above.
The probability to observe obs or more occurrences in the whole family
of sequences is calculated by the sum of binomials:
T obs-1
P(>=obs) = SUM P(i) = 1 - SUM P(i)
i=obs i=0
OVER/UNDER-REPRESENTATION
By default, the program calculates probability to have
at least obs occurrences:
T
P(>=occ) = SUM P(i)
i=occ
With the option -under, the program calculates the
probability of having less than obs occurrences :
occ-1
P(<=occ) = SUM P(i)
i=0
The option -under does not affect the other statistics
(zscore, log-likelihood). For z-score, the negative
values can be used to asses word under-representation.
SPECIFIC TREATMENT FOR DOUBLE STRAND COUNTS
When occurrences are counted on both strands, each pattern is
grouped with its reverse complement.
For reverse-palindromic patterns, probabilities are calculated
on the basis of the single strand count, since the occurrence
on the reverse complement strand is completely dependent on
that on the direct strand.
A more biological justification for this is that, although the
word is found on both strands in a string representation of
the sequences, at the structural level, there is a single
binding site for the factor.
On the contrary, for non-palindrommic patterns, occurrences on
the direct and reverse complement strand represent distinct
binding sites. Thus,
Obs_occ(W|Wr) = Obs_occ(W) + Obs_occ(Wr)
Exp_freq(W|Wr) = Exp_freq(W) + Exp_freq(Wr)
where
W is a given word
Wr is the reverse complement of W
Probabilities are then calculated as above, on the basis of
the event W|Wr instead of simply W.
E-VALUE
The probability of occurrence by itself is not fully
informative, because the threshold must be adapted depending
on the number of patterns considered. Indeed, a simple
hexanucleotide analysis amounts to consider 4096
hypotheses.
The E-value represented the expected number of patterns which
would be returned at random for a given P-value (probability).
E-value = NPO * P(>=obs)
where NPO is the number of possible oligomers of the
chosen length (eg 4096 for hexanucleotides).
Note that when searches are performed on both strands, NPO is
corrected for the fact that non-palindromic patterns are
grouped by pairs (for example, there are 2080 patterns when
hexanucleotides are counted on both strands).
SIGNIFICANCE INDEXES
The significance index is simply a negative logarithm
conversion of the E-value (in base 10).
The significance indexes are calculated as follows:
Sig_occ = -log10(E-value);
This index is very convenient to interpret : highest values
correspond to the most exceptional patterns.
OVERLAP COEFFICIENT
overlap coefficient is calculated as follows
(after Pevzner et al.(1989). J. Biomol. Struct & Dynamics
5:1013-1026):
l
Kov = SUM kj (1/4)^j
j=1
where l is the pattern length.
j is the overlap position, comprised between 0 and l.
kj takes the value 1 if there is an overlap at pos j,
0 otherwise.
When counts are performed on both strands, overlaps between
the pattern and its reverse complement are also taken into account
into the same formula.
Z-SCORE
The Z-score is calculated in the following way
Zsc = (obs_occ - exp_occ)/sd_occ
= (obs_occ - exp_occ)/sqrt(var_occ)
where
obs_occ is the observed number of occurrences
exp_occ is the expected number of occurrences
sd_occ and var_occ
are the estimated standard deviation and variances
for the occurrences, respectively.
The estimation of the variance is derived from Pevzner et al.(1989).
J Biomol Struct & Dynamics 5:1013-1026):
var_occ = exp_occ(2*Kov - 1 - (2*w-1)*exp_occ)
In random sequences, Z-scores are normally distributed. The probability
to observe a given number of occurrences can thus be read in the
normal table from any book of statistics.
Advantages of the Z-score:
- Z-score corrects the bias due to self-overlapping of a word, which
often leads to overestimate the overrepresentation of such words
(eg AAAAAA, TATATA).
- its calculation is very fast.
This is especially critical when analyzing
very big sequences (whole genomes), where the expected oligo nt
occurrences are very high (and binomial calculation very slow).
- Z-score provides a way to detect both over- and under-represented
patterns.
Disadvantages:
- the use of Z-score assumes that the sequences are infinite
Recommended thresholds:
=======================
strand w P(>=oc) z-score
-------------------------------
1str 3 0.98437 2.155
1str 4 0.00609 2.66
1str 5 0.99902 3.095
1str 6 0.99976 3.49
1str 7 0.99994 3.83
1str 8 0.99998 4.1
2str 3
2str 4
2str 5
2str 6 0.99952 3.30
2str 7
2str 8