RSAT: oligonucleotide analysis manual

RSAT - oligonucleotide analysis manual

Name

1997-98 by

Description

Options

Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.

Format:
Input sequence format. Various standards are supported.

Sequence type:
Input sequence type

DNA (default)
Only A, C, G, and T residues are accepted. oligomers that contain undefined (N) or partly defined (IUPAC code) nucleotides are discarded from the countings.
protein Oligopeptide analysis instead of oligonucleotide. This inactivates the grouping of oligomers with their reverse complements, and modifies the alphabet size.
other Any type of letters found in the input sequence is considered valid. This allows to analyze texts in human language.

Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches)) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).

Oligonucleotide size:
The analysis can be performed with oligonuleotides of any size between 1 and 8. Selecting size 1 amounts to counting the alphabet utilization within the input sequences. For the detection of regulatory sites, w recommend starting with an analysis of hexanucleotides (size=6), and scanning sizes between 4 and 8. When a pattern is significantly overrepresented, it generally appears from the analyses with various sizes.

Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).

Group reverse complement together in the output
(only valid for two strand analysis). This parameter does not affect the counting itself, but only the format of output. If this option is NOT checked, two separate lines are used to show a word and its reverse complement. This is redundant but might be useful for compatibility with other programs.

Prevent overlapping matches
Periodic patterns (e.g. AAAAAA, ATATAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favors additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.
For example, TATATATATATA would represent

2 occurrences of TATATA when self-overlap is prevented
5 occurrences of TATATA when self-overlap is allowed

Background model
Various probabilistic models can be used to estimate the expected frequency of each oligonucleotide.

Attention ! The results will be dramatically affected by the choice of expected frequency, which is the main specificity of this program. It has been shown that for the detection of regulatory sites in yeast upstream sequences, the best choice is to estimate the expected oligonucleotide frequencies on basis of the frequencies observed in the set of all upstream non-coding sequences from the genome. For the same purpose, choosing "equiprobable residues" would be totally inefficient, and "Residue frequencies from input sequence" works poorly.

Predefined background frequencies : Compare oligo frequencies observed in the query sequence to those of a reference sequence (the background model).
Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all oligonucleotide frequencies (from size 1 to 8) in different sequence types, and this for each organism.
- upstream: all upstream regions, allowing overlap with upstream ORFs.
- upstream-noorf: all upstream regions, preventing overlap with upstream ORFs (sequences are clipped to discard upstream ORF sequences).
Markov model : expected word (oligonucleotide) frequencies are calculated on the basis of the subword frequencies observed in the input sequence set. This calculation takes into account the higher order dependencies between neighbouring residues.
For example, with a Markov chain of order 4 :
Thus
For words of size k, the highest possible order is k-2. A Markov order of 0 amounts to use observed residue frequencies for calculating expected oligomer frequencies (no dependency between neighbour residues).
The higher the Markov order, the most stringent is the analysis: specificity is increased, but there si a loss of sensitivity, i.e. some relevant patterns might be overlooked. The optimal Markov order depends on the size of the sequence set. For small gene families (e.g. 10 sequences of 800bp), taking an order > 1 would result in a loss of sensitivity. For sequence sets of 1Mb, a Markov chain of 3 is optimal for hexanucleotides.
Lexicon partitioning : Expected word frequencies are calculated on the basis of subword frequencies, in a similar (but not identical) way to the "dictionary" approach developed by Harmen Bussemaker. Each word is segmented in 2 subwords in all possible ways:
```
			GATAAG	G & ATAAG
				GA & TAAG
				GAT & TAG
				GATA & AG
				GATAA & G
```
The expected frequency of each segmented pair is the product of expected frequencies of its members. The expected word frequency is the maximum expected pair frequency.
Residue frequencies from input sequence : (Note: this is equivalent to a Markov chain with order 0).
Equiprobable residues: This option gives very poor results and should never be used in practice. I leave it there only for didactic purposes (to allow anyone to test how bad it performs).
Upload your own expected frequency file:

Pseudo-frequency for the background model

When the background frequencies are based on a small sequence set, there is a risk to observe in the test sequences some oligomers which were totally absent from the background sequences. This would make a problem since these words are considered to have a 0 probability.

To circumvent this problem, a pseudo-frequency can be defined, which must be a number between 0 and 1. Expected frequencies are then corrected by a pseudo-frequency, which is the pseudo-weight divided by the number of possible patterns.

Return:

Occurrences: a simple count of the number of occurrences of each oligonucleotide. Overlapping matches are detected and summed in the counting.
Frequencies: relative frequencies, i.e. the number of occurrence of the oligonucleotides divided by the sum of occurrences for all oligonucleotides.
Matching sequences: the number of sequences from the input set which contain at least one occurrence of the oligonucleotide.
Ratio: observed/expected occurrence ratio. This ratio can be used as a rough indicator of over-representation, but it has the weakness to overestimate the patterns with a very weak number of expected occurrences. For instance, observing 1 occurrence when expecting 0.1 will have a very high index of 10 while it is quite likely to occur at random (proba ~10%). For comparison, observing 20 occurrences when expecting 10 has a probability of ~0.3%, although the ratio is only 2!
Proba: probabilities. Different statistics are calculated (see below for details of calculation).
- Expected occurrences (exp_occ): the number of occurrences expected for the considered oligonucleotide within the set of sequences. The calculation of this value depends on the probabilistic model selected by the user (see above).
- Occurrence probability (occ_pro): the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).
- Expected matching sequences (exp_ms): the expected number of sequences with at least one occurrence.
- Matching sequence probability (ms_pro): the probability to have L or more sequences with at least one occurrence of the oligonucleotide, given the probabilistic model (where L is the observed number of matching sequences).
- Significance index (sig): this is a conversion of the occurrence probability, taking into account the number of possible oligonucleotides (which varies with oligo size) and doing a logarithmic transformation. The highest sig corresponds to the most overrepresented oligonucleotide. Sig value higher than 0 indicate overrepresentation.

Thresholds:

Probabilities

    EXPECTED OCCURRENCES
	                          S
	   Exp_occ = p * T = p * SUM (Lj + 1 - k)
	                         j=1
	
	where	p  = probability of the pattern
		     Several models are supported for estimating the
		     prior probability (see options -a, -expfreq and
		     -bg).
		S  = number of sequences in the sequence set. 
		Lj = length of the jth regulatory region
		k  = length of oligomer
                T = the number of possible matching positions.
		


    PROBABILITY OF SEQUENCE MATCHING
	The probability to find at least one occurrence of the pattern within
	a single sequence is :
	
	                 T
	    q = 1 - (1-p)
	    
	with the same abbreviations as above


    EXPECTED NUMBER OF MATCHING SEQUENCES

	In this counting mode, only the first occurrence of each
	sequence is taken into consideration. We have thus to
	calculate a probability of first occurrence.

	   Exp_ms = n (1 - (1 - p)^T)
	
	with the same abbreviations as above
	
	Correction for autocorrelation (from Mireille Regnier)
		Exp_ms_corrected = n (1 - (1 - p/a)^T)
	   Where 
		 a is the coefficient of autocorrelation
		
    
    PROBABILITY OF THE OBSERVED NUMBER OF OCCURRENCES (BINOMIAL)
	
	The probability to observe exactly obs occurrences in the whole family
  	of sequences is calculated by the binomial
	
	                                              obs      T-obs
	    P(obs) = bin(p,T,obs) =       T!         p    (1-p)
                                     ---------------
                                     obs! * (T-obs)! 
	
	where   obs is the observed number of occurrences,
                p   is the expected frequency for the pattern,
                T   is the number of possible matching positions,
                    as defined above. 
	
	The probability to observe obs or more occurrences in the whole family
  	of sequences is calculated by the sum of binomials:
	
	                 T              obs-1
	    P(>=obs) =  SUM P(i) =  1 -  SUM  P(i)
	               i=obs             i=0

    OVER/UNDER-REPRESENTATION

		By default, the program calculates probability to have
		at least obs occurrences:

			                 T
			    P(>=occ) =  SUM P(i)
			               i=occ

		With the option -under, the program calculates the
		probability of having less than obs occurrences : 

			               occ-1
			    P(<=occ) =  SUM P(i)
			                i=0

		The option -under does not affect the other statistics
		(zscore, log-likelihood). For z-score, the negative
		values can be used to asses word under-representation.

	                        
    SPECIFIC TREATMENT FOR DOUBLE STRAND COUNTS

	When occurrences are counted on both strands, each pattern is
	grouped with its reverse complement. 

	For reverse-palindromic patterns, probabilities are calculated
	on the basis of the single strand count, since the occurrence
	on the reverse complement strand is completely dependent on
	that on the direct strand. 

        A more biological justification for this is that, although the
        word is found on both strands in a string representation of
        the sequences, at the structural level, there is a single
        binding site for the factor. 


	On the contrary, for non-palindrommic patterns, occurrences on
        the direct and reverse complement strand represent distinct
        binding sites. Thus, 

		 Obs_occ(W|Wr) = Obs_occ(W) + Obs_occ(Wr)
		 Exp_freq(W|Wr) = Exp_freq(W) + Exp_freq(Wr)

	   where
		 W     is a given word
		 Wr    is the reverse complement of W

	Probabilities are then calculated as above, on the basis of
	the event W|Wr instead of simply W.

    E-VALUE

	The probability of occurrence by itself is not fully
	informative, because the threshold must be adapted depending
	on the number of patterns considered. Indeed, a simple
	hexanucleotide analysis amounts to consider 4096
	hypotheses. 

	The E-value represented the expected number of patterns which
	would be returned at random for a given P-value (probability).

	      E-value = NPO * P(>=obs)

	where	NPO	 is the number of possible oligomers of the 
	                 chosen length (eg 4096 for hexanucleotides). 

        Note that when searches are performed on both strands, NPO is
        corrected for the fact that non-palindromic patterns are
        grouped by pairs (for example, there are 2080 patterns when
        hexanucleotides are counted on both strands).


    SIGNIFICANCE INDEXES

        The significance index is simply a negative logarithm
        conversion of the E-value (in base 10).


	The significance indexes are calculated as follows:
	
	      Sig_occ = -log10(E-value);

	This index is very convenient to interpret : highest values
	correspond to the most exceptional patterns.


    OVERLAP COEFFICIENT
        overlap coefficient is calculated as follows 
        (after Pevzner et al.(1989). J. Biomol. Struct & Dynamics 
        5:1013-1026):

	           l    
            Kov = SUM kj (1/4)^j
                  j=1

        where l  is the pattern length. 
              j  is the overlap position, comprised between 0 and l.
              kj takes the value 1 if there is an overlap at pos j,
                 0 otherwise.

        When counts are performed on both strands, overlaps between
        the pattern and its reverse complement are also taken into account
        into the same formula.			

    Z-SCORE
	The Z-score is calculated in the following way

		Zsc = (obs_occ - exp_occ)/sd_occ
	            = (obs_occ - exp_occ)/sqrt(var_occ)

	where
		obs_occ	is the observed number of occurrences
		exp_occ	is the expected number of occurrences
		sd_occ and var_occ
                    are the estimated standard deviation and variances
                    for the occurrences, respectively.
    

	The estimation of the variance is derived from Pevzner et al.(1989). 
	J Biomol Struct & Dynamics 5:1013-1026):
		var_occ = exp_occ(2*Kov - 1 - (2*w-1)*exp_occ)

	In random sequences, Z-scores are normally distributed. The probability 
	to observe a given number of occurrences can thus be read in the 
	normal table from any book of statistics.

	Advantages of the Z-score:
	- Z-score corrects the bias due to self-overlapping of a word, which 
	  often leads to overestimate the overrepresentation of such words
	  (eg AAAAAA, TATATA). 
	- its calculation is very fast. 
	  This is especially critical when analyzing 
	  very big sequences (whole genomes), where the expected oligo nt 
	  occurrences are very high (and binomial calculation very slow).
	- Z-score provides a way to detect both over- and under-represented 
	  patterns. 

	Disadvantages:	
	- the use of Z-score assumes that the sequences are infinite

	Recommended thresholds:
	=======================
	strand	w	P(>=oc)	z-score
	-------------------------------
	1str	3	0.98437	2.155
	1str	4	0.00609	2.66
	1str	5	0.99902	3.095
	1str	6	0.99976	3.49	
	1str	7	0.99994	3.83
	1str	8	0.99998	4.1

	2str	3
	2str	4
	2str	5
	2str	6	0.99952	3.30
	2str	7	
	2str	8