RSAT - peak-motifs manual

NAME
VERSION
DESCRIPTION
AUTHORS
CATEGORY
USAGE
INPUT FORMAT
OUTPUT FORMAT
OPTIONS
REFERENCES
SEE ALSO
WISH LIST

NAME

peak-motifs

VERSION

$program_version

DESCRIPTION

Pipeline for discovering motifs from ChIP-seq (or ChIP-chip, or ChIP-PET) peak sequences.

AUTHORS

Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>: Conception and implementation of the work flow + testing.
Morgane Thomas-Chollier <thomas-c@molgen.mpg.de>: Conception of the work flow + implementation of Web interface + testing.
Matthieu Defrance <defrance@ccg.unam.mx>: Implementation of the efficient algorithms used in the work flow (count-words, matrix-scan-quick, local-word-analysis).
Olivier Sand <oly@bigre.ulb.ac.be>: Web services.
Carl Herrmann <carl.herrmann@univmed.fr> and Denis Thieffry <thieffry@tagc.univ-mrs.fr>: Analysis of the case studies. Definition of optimal conditions of utilization. Motif comparisons and clustering.

USAGE

peak-motifs [-i inputfile] [-o outputfile] [-v #] [...]

INPUT FORMAT

The program takes as input either one (test) or two sequence files (test versus control).

All sequences formats supported in input by convert-sequences are supported.

OUTPUT FORMAT

The pipeline runs a series of programs generating each one or several result file. An HTML index is generated in order to synthesize the results and give access to the individual result files.

The index file is formed from the output directory (option -outdir) and the file prefix (option -prefix).

  [output_dir]/[prefix]_synthesis.html

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-i test_seq_file

Test peak sequence file (mandatory).

For single-set analysis, this file contains the peak sequences of the unique set. For test versus control analysis, it contains the test sequences.

-ctrl control_seq_file

Control peak sequence file (optional).

The control sequence file is used: - as control sequence for oligo-diff - to estimate the background models for oligo-analysis and dyad-analysis.

Control sequences are supposed to contain a large number of sequences without particular enrichment for any motif. The choice of appropriate background sequences is crucial to detect relevant motifs.

The file should be sufficiently large (several Mb) to provide a robust estimate of prior probabilities (frequencies expected at random) for oligonucleotides and dyads.

Typical examples of control sequences:

- random fragments of the genome of interest (e.g. obtained with random-genome-fragments)

- sets of sequences pulled down in a mock experiment (without the antibody) and characterized by ChIP-seq or ChIP-chip.

- sets of peaks for a compendium of transcription factors different from the factor of interest.

-max_seq_len msl

Maximal sequence length. larger sequences are truncated at the specified length around the sequence center (from -msl/2 to +msl/2).

-top_peaks N

Restrict the analysis to the N peaks at the top of the test sequence file. Some peak calling programs return the peaks sorted by score. In such case, the -top_peaks option allows to restrict the analysis to the highest scoring peaks. In some cases, the top-scoring peaks might contain a higher density of binding sites, allowing to detect motifs with a higher significance.

This option can also be convenient for performing quick tests, parameter selection and debugging before running the full analysis of large sequence sets.

-ref_motifs reference_motif

Reference motif (optional).

In some cases, we already dispose of a reference motif, for example the motif annotated in some transcription factor database (e.g. RegulonDB, Jaspar, TRANSFAC) for the transcription factor of interest. These annotations may come from low-throughput experiments, and rely on a poor number of sites, but the reference motif may nevertheless be informative, because it is based on several independent studies.

Each discovered motif can be compared to the reference motif, in order to evaluate its correspondence with the binding motif of the factor of interest.

Reference motifs should be provided in TRANSFAC format (see convert-matrix for interconversions between matrix formats).

-motif_db db_name db_format db_file

File containinf a database of transcription factor binding motifs (e.g. JASPAR, TRANSFAC, RegulonDB, ...) which will be compared to the discovered motifs (task motifs_vs_db).

The option requires three arguments:

 - DB name

 - matrix format. same supported formats as convert-matrices, but we
   recommend to use a format that includes an ID and a name for each
   motif (e.g. TRANSFAC)

 - file containing the DB motifs

The option can be called iteratively on the same command line in order to compare discovered motifs with several databases.

Examples:

 -motif_db TRANSFAC transfac transfac_download_dir/cgi-bin/data/matrix.dat

   will load a file containing all matrices from the TRANSFAC
   database.

 -motif_db JASPAR jaspar jaspar_file.tf

   will load a file containing motifs from the JASPAR database that
   have previously been converted to TRANSFAC format.

-outdir output_directory

Output directory (mandatory).

The result files and index files produced by the different programs will be stored in this directory.

-prefix output_prefix

Prefix for the output files.

-title graph_title

Title displayed on top of the graphs.

-img_format img_format

Image format.

All the formats supported by XYgraph can be used.

-task

Specify a subset of tasks to be executed.

By default, the program runs all necessary tasks. However, in some cases, it can be useful to select one or several tasks to be executed separately.

Beware: task selection requires expertise, because most tasks depends on the prior execution of some other tasks in the workflow. Selecting tasks before their prerequisite tasks have been completed will provoke fatal errors.

Default tasks

all (default)

Run all the default tasks.

purge

Purge test sequences (test set and, if specified, control set) to mask redundant fragments before applying pattern discovey algorithms. Sequence purging is necessary because redundant fragments would violate the hypothesis of independence underlying the binomial significance test, resulting in a large number of false positive patterns.

seqlen

Compute sequence lengths and their distribution.

Sequence lengths are useful for the negative control (selection of random genome fragments).

Sequence length distribution is informative to get an idea about the variability of peak lengths.

composition

Compute compositional profiles, i.e. distributions of residues and dinucleotide frequencies per position (using position-analysis).

Residue profiles may reveal composition biases in the neighborhood of the peak sequences. Dinucleotide profiles can reveal (for example) an enrichment in CpG island.

Note that peak-motifs also runs position-analysis with larger oligonucleotide length (see option -l) to detect motifs on the basis of positionally biased oligonucleotides (see task positions).

ref_motifs

This task combines various operations.

Formating of the reference motif: Perform various format conversion for the reference motif (compute parameters, consensus, logo).
Motif enrichment: Generate an enriched motif by scanning the peak sequence set with the reference motif.
Motif comparison: Compare all discovered motifs with the reference motif.

oligos

Run oligo-analysis to detect over-represented oligonucleotides of a given length (k, specified with option -l) in the test set (van Helden et al., 1998). Prior frequencies of oligonucleotides are taken from Markov model of order m (see option -markov) estimated from the test set sequences themselves.

dyads

Run dyad-analysis to detect over-represented dyads, i.e. pairs of short oligonucleotides (monads) spaced by a region of fixed width but variable content (van Helden et al., 2000). Spaced motifs are typical of certain classes of transcription factors forming homo- or heterodimers.

By default, peak-motifs analyzes pairs of trinucleotides with any spacing between 0 and 20.

The expected frequency of each dyad is estimated as the product of its monad frequencies in the test sequences (option -bg monads of dyad-analysis).

positions

Run position-analysis to detect oligonucleotides showing a positional bias, i.e. have a non-homogeneous distribution in the peak sequence set.

This method was initially developed to analyze termination and poly-adenylation signals in downstream sequences (van Helden et al., 2001), and it turns out to be very efficient for detecting motifs centred on the ChIP-seq peaks. For ChIP-seq analysis, the reference position is the center of each sequence.

Note that peak-motifs also uses position-analysis for the task composition, in order to detect compositional biases (residues, dinucleotides) in the test sequence set.

local_words

Run local-word-analysis to detect locally over-represented oligonucleotides and dyads.

The program local-word-analysis (Matthieu Defrance,unpublished) tests the over-representation of each possible word (oligo, dyad) in positional windows in the test sequence set.

Two types of background models are supported: (i) Markov model of order m estimated locally (within the window under consideration; (ii) the frequency observed for a word in the whole sequence set is used as estimator of the prior probability of this word in the window.

After our first trials, this program gives excellent results in ChIP-seq datasets, because its senstivitity increases with large number of sequences (several hundreds/thousands), and its background model is more stringent than for programs computing the global over-representation (oligo-analysis, dyad-analysis).

motifs_vs_ref

Compare each discovered motif to the reference motifs.

motifs_vs_db

Compare each discovered motif to a database of known motifs (e.g. Jaspar, TRANSFAC, RegulonDB, UniProbe, ...)

timelog

Generate a log file summarizing the time spent in the different tasks.

synthesis

Generate the HTML file providing a synthesis of the results and pointing towards the individual result files.

Extra tasks

A few extra tasks are available, which are not executed by default. Those tasks are executed only when they are explicitly invoked with the option -task, they are not called with the option "-task all".

clean_seq

Delete the purged sequence files after the analysis, in order to save space.

meme_bg

Compute meme background model from the test sequences.

meme

Run the motif discovery program MEME on the test sequences.

Beware: the complexity of MEME is quadratic: the computing time increases as the square of sequence size. It is thus not recommended to use MEME for data sets exceeding 1Mb. If the test set contains many peaks, the option -task meme can be combined with a restriction on the number of top peaks to be considered (e.g. -top_peaks 500).

-nmotifs max_motif_number

Maximal number of motifs (matrices) to return for motif discovery algorithms. Note the distinction between the maximal number of motifs (matrices) and the maximum number of patterns (words, dyads): a motif generally corresponds to mutually overlapping several patterns (dyads, words).

-l oligo_len

Oligonucleotide length for word-counting approaches (oligo-analysis, position-analysis, local-word-analysis, oligo-diff).

In our experience, optimal results are obtained with hexanucleotides and heptanucleotides.

Note: the monad length used for dyad-analysis is not affected by those options. Instead it is fixed to to 3. Indeed, dyad-analysis can detect larger motifs by sampling various spacings between the two trinucleotide monads.

-minol oligo_min_len

-maxol oligo_max_len

Minimal (-minol) and maximal (-maxol) oligonucleotide lengths. If those options are used, the program iterated over the specified range of oligonucleotide lengths.

-markov

Order of the Markov model used to estimate expected oligonucleotide frequencies for oligo-analysis and local-word-analysis.

Higher order Markov models are more stringent, lower order are more sensitive, but tend to return a large number of false positives.

Markov models can be specified with either a positive or a negative value. Positive value indicate the length of the prefix in the transition matrix. Negative value indicate the order of the Markov model relative to the oligonucleotide length. For example, the option -markov -2 gives a model of order m=k-2 (thus, an order 5 for heptanucleotides, an order 4 for hexanucleotides).

The optimal Markov order depends on the number of sequences in the test set. Since ChIP-seq data typically contain hundreds to thoursands of peaks, high Markov orders are generally good, because they are stringent and still sensitive enough. In our experience, motifs are well detected with the most stringent Markov order (-markov -2).

-min_markov min_markov_order

-max_markov max_markov_order

A miminal and a maximal value can be specified for the Markov order. The program then iterates over all markov values between min_markov_order and max_markov_order.

-1str | -2str

Single-strand (-1str) or double-strand (-2str) analysis.

The default is double-strand analysis, since ChIP-seq results have no particular strand orientation.

-noov | -ovlp

Treatment of self-overlapping words for motif discovery: count (-ovlp) or do not count (-noov) overlapping occurrences. In -noov mode, only renewing occurrences are counted.

It is recommended to use the -noov mode (default) to avoid the effect of self-overlap, which violates the hypothesis of independence of successive occurrences underlying the binomial significance test (oligo-analysis, dyad-analysis).

Beware: the options -noov and -ovlp only apply to motif discovery, and not to compositional profiles. Dinucleotide frequencies are always computed with the option -ovlp (count all occurrences), to avoid weird effect. Since those compositin profiles further serve to estimate the probability of larger words, which may include repeated residues, we need to count all dinucleotide occurrences. Indeed with the -noov mode (renewing occurrences only), the transition tables of the first order Markov model would be unbalanced: the expected frequency of all the repeated dinucleotides (AA, TT, CC, GG) would be under-estimated, leading to an under-estimation of the expected frequency of repeat-containing words (e.g. AAAAAA, AAAGGG, ...).

-ci class_interval

Class interval for position-analysis.

REFERENCES

The program peak-motifs combines a series of tried-and-tested programs which have been detailed in the following publications.

oligo-analysis: van Helden, J., Andre, B. and Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827-42.
dyad-analysis: van Helden, J., Rios, A. F. and Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 28, 1808-18.
position-analysis: van Helden, J., del Olmo, M. and Perez-Ortin, J. E. (2000). Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28, 1000-10.
matrix-scan: Turatsinze, J. V., Thomas-Chollier, M., Defrance, M. and van Helden, J. (2008). Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc 3, 1578-88.

WISH LIST

background models from ctrl sequences

Estimate background models from control sequences, for oligo-analysis, dyad-analysis, and local-word-analysis. This should in principle reduce the rate of false positive.

partial synthesis

For the Web server: generate temporary synthetic table showing the results already obtained so far, and finishing by a message "Partial results, please don't forget to reload the file later".

motif_cluster

Compare all discovered motifs (plus reference motif if specified) and cluster them in order to extract a consensus motif.

weeder

Add a task to run Weeder on the peak sequences.

 weederlauncher.out input organism large S M T5

all_oligos

Run oligo-analysis without any threshold in order to produce a plot of observed versus expected occurrences for all the oligonucleotides. This analysis is performed with the option -two_tails, which detects both under- and over-represented patterns.

full HTML report

- link to the directories for each algorithm/task

- link from the result page to the link table returned by position-analysis (file *_graph_index.html).