info-gibbs
2009 by Matthieu Defrance
info-gibbs is a motif discovery software based on a Gibbs sampling strategy. Given a set of sequences, a motif length and a background model it searches for motifs (PSSMs) that have the best relative entropy (information content).
Defrance M. and van Helden J. info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling, Bioinformatics. 2009;25:2715-2722.
Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.Format:
Input sequence format. Various standards are supported.Sequence type:
Only A, C, G, and T residues are accepted. oligomers that contain undefined (N) or partly defined (IUPAC code) nucleotides are discarded.Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).Search both strands: (single or both strands)
By selecting "search both strands", the occurrences of the motif are searched on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).Matrix length:
The length of the motif. For the detection of regulatory sites, we recommend starting with a length comprise between 6 and 16.Expected number of matches per sequences:
This option allow to specify the number of the motif occurrences that are expected to be found in each of the input sequences.Number of motif to extract:
This option allows to search for more than one motif in the input sequences.Maximum number of iterations:
This option allows to set the maximal number of iteration of the algorithm.Number of runs:
Due to its stochastic behavior info-gibbs can return different results each time it is run. The core algorithm should be repeated a sufficient number of times in order to produce useful results.Background:
The background model used to compute expected frequencies can be computed from input sequences or loaded form a predefined Markov model. In this later case, a target organism and a Markov order should be selected.