local-word-analysis
2006-2008 by Matthieu Defrance
Calculates oligonucleotide occurrences in a set of sequences, and detects locally overrepresented or underrepresented oligonucleotides.
Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.Format:
Input sequence format. Various standards are supported.Sequence type:
Only A, C, G, and T residues are accepted. oligomers that contain undefined (N) or partly defined (IUPAC code) nucleotides are discarded from the countings.Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).Motif length:
The analysis can be performed with:
- oligonuleotides of any size between 1 and 8. For the detection of regulatory sites, we recommend starting with an analysis of hexanucleotides (size=6), and scanning sizes between 4 and 8. When a pattern is significantly overrepresented, it generally appears from the analyses with various sizes.
- spaced motifs (dyads) with 2 monads of any size between 1 and 3 separated by a spacing. The spacing is the number of bases between the end of the fisrt element (monad) and the start of the second one. A single integer value means that the spacing is fixed (example from=10 to=10). Variable spacing can be introduced by entering the from and to values. For example from 8 to 12 means that all occurrences of the dyad with a spacing between 8 and 12 will be counted together and their significance estimated globally. Warning, this is different from scanning one by one the spacing values 8, 9, 10, 11 and 12.
Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).Align: (Right or Left)
By selecting "Right", the positions in all input sequences are computed relatively to the right bound of each sequence. Align "Right" should be used when dealing with a set of upstream sequences that have different lengths. By selection "Left", the positions in all input sequences are computed relatively to the left bound of each sequence. Align "Left" should be used when dealing with a set of downstream sequences that have different lengths.Window Width:
Different size of of window can be used to search for locally overrepresented motif:
- No window. In this case, the search is performed like oligo-analysis or dyad-analysis and do not use locally overrepresention.
- Window of fixed size. When this option is selected, orm search for overrepresented motifs in each window of the given size. When the option group windows is checked, windows can be merged to form longer windows.
- Variable window size. The window size is automatically adjusted to best fit the data. This option can heavily slow down the search.
Background Window Width:
The background model can be fragmented in several segments of fixed width.Prevent overlapping matches
Periodic oligonucleotides (e.g. AAAAAA, ATATAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favors additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.
For example, TATATATATATA would representExpected frequency:
- 2 occurrences of TATATA when self-overlap is prevented
- 5 occurrences of TATATA when self-overlap is allowed
Various probabilistic models can be used to estimate the expected frequency of each oligonucleotide.
Warning ! The results will be dramatically affected by the choice of expected frequency, which is the main specificity of this program. It has been shown that for the detection of regulatory sites in yeast upstream sequences, the best choice is to estimate the expected oligonucleotide frequencies on basis of the frequencies observed in the set of all non-coding upstream sequences from the genome. For the same purpose, choosing "equiprobable residues" would be totally inefficient, and "Residue frequencies from input sequence" works poorly.
- Predefined background frequencies : Compare oligo frequencies observed in the query sequence to those of a reference sequence (the background model). Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all oligonucleotide frequencies (from size 1 to 8) in different sequence types, and this for each organism.
- upstream: all upstream regions, allowing overlap with upstream ORFs.
- upstream-noorf: all upstream regions, preventing overlap with upstream ORFs (sequences are clipped to discard upstream ORF sequences).
- Markov models: expected word (oligonucleotide) frequencies are calculated on the basis of the subword frequencies observed in the input sequence set. This calculation takes into account the higher order dependencies between neighboring residues. For example, with a markov chain of order 4 :
ThusP(GATAAC) = P(GATAA) * P(C|GATAA) = P(GATAA) * P(ATAAC) / P(ATAA)For words of size k, the highest possible order is k-2. A Markov order of 0 amounts to use observed residue frequencies for calculating expected oligomer frequencies (no dependency between neighbor residues). The higher the Markov order, the most stringent is the analysis: specificity is increased, but there si a loss of sensitivity, i.e. some relevant motifs might be overlooked. The optimal Markov order depends on the size of the sequence set. For small gene families (e.g. 10 sequences of 800bp), taking an order > 1 would result in a loss of sensitivity. For sequence sets of 1Mb, a Markov chain of 3 is optimal for hexanucleotides.Expected(GATAAC) = observed(GATAA) * observed(ATAAC) / observed(ATAA)- Equiprobable residues from input sequence : (Note: this is equivalent to a Markov chain with order 0).
Thresholds:
Thresholds can be imposed to select the most significantly overrepresented motifs. When acting on probabilities, upper thresholds are used (i.e. values smaller than the threshold are returned). For occurrence numbers, matching sequences and significance indices, lower thresholds are used (i.e. all values higher than the threshold are returned). A threshold of 0 on occurrence significance index is selected by default. This is the most efficient way we found to automatically select the biologically significant regulatory sites, irrespective of oligonucleotide size, number and size of the sequences in the input set.