RSAT - local-word-analysis manual



2006-2008 by Matthieu Defrance


Calculates oligonucleotide occurrences in a set of sequences, and detects locally overrepresented or underrepresented oligonucleotides.


Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.

Input sequence format. Various standards are supported.

Sequence type:
Only A, C, G, and T residues are accepted. oligomers that contain undefined (N) or partly defined (IUPAC code) nucleotides are discarded from the countings.

Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (

Motif length:
The analysis can be performed with:

Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).

Align: (Right or Left)
By selecting "Right", the positions in all input sequences are computed relatively to the right bound of each sequence. Align "Right" should be used when dealing with a set of upstream sequences that have different lengths. By selection "Left", the positions in all input sequences are computed relatively to the left bound of each sequence. Align "Left" should be used when dealing with a set of downstream sequences that have different lengths.

Window Width:
Different size of of window can be used to search for locally overrepresented motif:

Background Window Width:
The background model can be fragmented in several segments of fixed width.

Prevent overlapping matches
Periodic oligonucleotides (e.g. AAAAAA, ATATAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favors additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.
For example, TATATATATATA would represent

Expected frequency:
Various probabilistic models can be used to estimate the expected frequency of each oligonucleotide.

Warning ! The results will be dramatically affected by the choice of expected frequency, which is the main specificity of this program. It has been shown that for the detection of regulatory sites in yeast upstream sequences, the best choice is to estimate the expected oligonucleotide frequencies on basis of the frequencies observed in the set of all non-coding upstream sequences from the genome. For the same purpose, choosing "equiprobable residues" would be totally inefficient, and "Residue frequencies from input sequence" works poorly.

Thresholds can be imposed to select the most significantly overrepresented motifs. When acting on probabilities, upper thresholds are used (i.e. values smaller than the threshold are returned). For occurrence numbers, matching sequences and significance indices, lower thresholds are used (i.e. all values higher than the threshold are returned). A threshold of 0 on occurrence significance index is selected by default. This is the most efficient way we found to automatically select the biologically significant regulatory sites, irrespective of oligonucleotide size, number and size of the sequences in the input set.