RSAT: dyad analysis manual

RSAT - dyad analysis manual

Name

dyad-analysis
1998 by

Description

Detects overrepresented spaced dyads in a set of DNA sequences. A dyad is a pair of oligonucleotides of the same size. They can be separated by a fixed number of bases.
This algorithm detects a set of binding sites that is not detected by oligo-analysis, because of the variability within the spacer region. A typical example of patterns that are efficiently detected by the dyad analysis is the binding site for the yeast Gal4p transcription factor, which has the consensus CGGNNNNNWNNNNNCCG.

Options
Title
(facultative)
Title of the data set. This information is returned as title of the result page.

Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.

Format:
Input sequence format. Various standards are supported.

Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches)) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).

Oligonucleotide size
This is the size of a single element (a half dyade).

Spacing
Spacing between the elements of the dyad. The spacing is the number of bases between the end of the fisrt element and the start of the second one.
A single integer value means that the spacing is fixed. Variable spacing can be introdued by entering the min and max values separated by a hyphen. For example 8-12 means that all occurrences of the dyad with a spacing between 8 and 12 qill be counted together and their significance estimated globally. Warning, this is different from scanning one by one th spacing values 8 to 12.

Dyad type
In order to fasten execution, the program can be asked to restrict its analysis to symmetric dyads, with 3 possibilities :

direct repeats: the second element is the same as the first one
inverted repeats: the second element is the reverse complement of the first one.
any repeat: analyse both direct and inverted repeats

When selecting the option any dyad, the analysis is performed on all dyads, symmetric as well as non-symmetric. Warning: the number of dyads increases dramatically with this option, and it should not be used for elements widers than 3 nucleotides.

Count on:
(single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).

Prevent overlapping matches
Periodic patterns (e.g. AAAn{0}AAA, TATn{1}TAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favours additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.
For example, the string AAAAAAAAAAAAAA would represent

7 occurrences of AAAn{1}AAA when self-overlap is allowed
2 occurrences of AAAn{1}AAA when self-overlap is prevented

Expected frequency calibration
Background model
Compare dyad frequencies observed in the query sequence to those of a reference sequence (the background model).
Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all dyad frequencies (monad size 3, spacing from 0 to 20) in different sequence types, and this for each organism.

upstream: all upstream regions, allowing overlap with upstream ORFs.
upstream-noorf: all upstream regions, preventing overlap with upstream ORFs (sequences are clipped to discard upstream ORF sequences).

Monad frequencies from the input sequence
The frequency expected for each dyad is the product of the frequency observed expected for each monad (oligonucleotide) in the sequence file.
	exp(dyad) = exp(oligo1)*exp(oligo2)
Threshold of significance:

Thresholds can be imposed to select the most significantly overrepresented patterns. A threshold of 0 on occurrence significance index is selected by default. This is the most efficient way we found to automatically select the biologicaly significant regulatory sites, irrespective of oligonucleotide size, number and size of the sequences in the input set.
Output columns

Expected frequency (exp_frq): the probability to observe the dyad at each position. This value is calculated on basis of the expected frequency table (see below).
Observed occurrences (obs_occ): the number of ocurrences observed for each dyad. Overlapping matches are detected and summed in the counting.
Expected number of occurrences (exp_occ): the number of ocurrences expected for each dyad. This value is calculated on basis of the oligonucleotide frequency table selected.
Occurrence probability (occ_pro): the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).
Occurrence Significance (occ_sig): this is a conversion of the occurrence probability, taking into account the number of possible dyads (which varies with oligo size) and doing a logarithmic transformation. The highest sig correspond to the most overrepresented oligonucleotide. Sig value higher than 0 indicate overrepresentation.

Probabilities
Various calibration models can be used to estimate the probability of each oligonucleotide (see above). From there, and expected number of occurrences is calculated and compared to the observed number of occurrences. The significance of the observed number of occurrences is calculated with the binomial formulae.
	
    EXPECTED DYAD FREQUENCY
	If exp(oligo1) is the expected frequency for the first element, and
	   exp(oligo1) is the expected frequency for the second element
	
	Then
	   exp(dyad) = exp(oligo1)*exp(oligo2)

    NUMBER OF POSSIBLE DYADS
	This number depends on the dyad type selected by the user. 
	When the analysis is restricted to inverted repeats, or to direct 
	repeats, the first element univocally determines the second one, 
	thus:
		nb_poss_dyads = nb_poss_oligo
		              = 4^w
		where w is the oligonucleotide length.

	When any dyad is allowed, each oligonucleotide can combine with any 
	other or itself, thus:
		nb_poss_dyads = nb_poss_oligo * nb_poss_oligo 
		              = 4^2w


    EXPECTED OCCURRENCES
	                      r
	   Exp_occ = p * 2 * SUM (Lj + 1 - d) = p * T
	                     j=1
	
	where	p  = expected dyad frequency
		n  = number of input sequences
		Lj = length of the jth input sequence
		d  = length of the dyad, calculated as follows:
			d = 2w + s
			where w is the oligonucleotide length
			      s is the spacer length
                T  = the number of possible matching positions in the 
		     whole set of input sequences.

		The factor 2 stands for the fact that occurrences are summed
		on both strands (it is omitted when the option -1str 
                is active).

    PROBABILITY OF THE OBSERVED NUMBER OF OCCURRENCES
	
	The probability to observe exactly obs occurrences in the whole set
  	of sequences is calculated by the binomial
	
	                                              obs      T-obs
	    P(obs) = bin(p,T,obs) =       T!         p    (1-p)
                                     ---------------
                                     obs! * (T-obs)! 
	
	where   obs is the observed number of dyad occurrences,
                p   is the expected dyad frequency,
                T   is the number of possible matching positions,
                    as defined above. 
	
	The probability to observe obs or more occurrences in the whole set of
  	of sequences is calculated by the sum of binomials:
	
	                    obs-1
	    P(>=obs) =  1 - SUM P(j)
	                    j=0
	                        
    SIGNIFICANCE INDEX
        The significance index is a conversion of the occurrence probability, 
	calculated as follows:.
	
	      Sig_occ = -log10(NPD * P(>=obs));

	where	NPD	is the number of possible dyads, calculated as above.
For information, contact