peak-motifs
$program_version
Pipeline for discovering motifs from ChIP-seq (or ChIP-chip, or ChIP-PET) peak sequences.
Conception and implementation of the work flow + testing.
Conception of the work flow + implementation of Web interface + testing.
Implementation of the efficient algorithms used in the work flow (count-words, matrix-scan-quick, local-word-analysis).
Web services.
Analysis of the case studies. Definition of optimal conditions of utilization. Motif comparisons and clustering.
motif discovery
peak-motifs [-i inputfile] [-o outputfile] [-v #] [...]
The program takes as input either one (test) or two sequence files (test versus control).
All sequences formats supported in input by convert-sequences are supported.
The pipeline runs a series of programs generating each one or several result file. An HTML index is generated in order to synthesize the results and give access to the individual result files.
The index file is formed from the output directory (option -outdir) and the file prefix (option -prefix).
[output_dir]/[prefix]_synthesis.html
Level of verbosity (detail in the warning messages during execution)
Display full help message
Same as -h
Test peak sequence file (mandatory).
For single-set analysis, this file contains the peak sequences of the unique set. For test versus control analysis, it contains the test sequences.
Control peak sequence file (optional).
The control sequence file is used: - as control sequence for oligo-diff - to estimate the background models for oligo-analysis and dyad-analysis.
Control sequences are supposed to contain a large number of sequences without particular enrichment for any motif. The choice of appropriate background sequences is crucial to detect relevant motifs.
The file should be sufficiently large (several Mb) to provide a robust estimate of prior probabilities (frequencies expected at random) for oligonucleotides and dyads.
Typical examples of control sequences:
- random fragments of the genome of interest (e.g. obtained with random-genome-fragments)
- sets of sequences pulled down in a mock experiment (without the antibody) and characterized by ChIP-seq or ChIP-chip.
- sets of peaks for a compendium of transcription factors different from the factor of interest.
Maximal sequence length. larger sequences are truncated at the specified length around the sequence center (from -msl/2 to +msl/2).
Restrict the analysis to the N peaks at the top of the test sequence file. Some peak calling programs return the peaks sorted by score. In such case, the -top_peaks option allows to restrict the analysis to the highest scoring peaks. In some cases, the top-scoring peaks might contain a higher density of binding sites, allowing to detect motifs with a higher significance.
This option can also be convenient for performing quick tests, parameter selection and debugging before running the full analysis of large sequence sets.
Reference motif (optional).
In some cases, we already dispose of a reference motif, for example the motif annotated in some transcription factor database (e.g. RegulonDB, Jaspar, TRANSFAC) for the transcription factor of interest. These annotations may come from low-throughput experiments, and rely on a poor number of sites, but the reference motif may nevertheless be informative, because it is based on several independent studies.
Each discovered motif can be compared to the reference motif, in order to evaluate its correspondence with the binding motif of the factor of interest.
Reference motifs should be provided in TRANSFAC format (see convert-matrix for interconversions between matrix formats).
File containinf a database of transcription factor binding motifs (e.g. JASPAR, TRANSFAC, RegulonDB, ...) which will be compared to the discovered motifs (task motifs_vs_db).
The option requires three arguments:
- DB name
- matrix format. same supported formats as convert-matrices, but we recommend to use a format that includes an ID and a name for each motif (e.g. TRANSFAC)
- file containing the DB motifs
The option can be called iteratively on the same command line in order to compare discovered motifs with several databases.
Examples:
-motif_db TRANSFAC transfac transfac_download_dir/cgi-bin/data/matrix.dat
will load a file containing all matrices from the TRANSFAC database.
-motif_db JASPAR jaspar jaspar_file.tf
will load a file containing motifs from the JASPAR database that have previously been converted to TRANSFAC format.
Output directory (mandatory).
The result files and index files produced by the different programs will be stored in this directory.
Prefix for the output files.
Title displayed on top of the graphs.
Image format.
All the formats supported by XYgraph can be used.
Specify a subset of tasks to be executed.
By default, the program runs all necessary tasks. However, in some cases, it can be useful to select one or several tasks to be executed separately.
Beware: task selection requires expertise, because most tasks depends on the prior execution of some other tasks in the workflow. Selecting tasks before their prerequisite tasks have been completed will provoke fatal errors.
Default tasks
Run all the default tasks.
Purge test sequences (test set and, if specified, control set) to mask redundant fragments before applying pattern discovey algorithms. Sequence purging is necessary because redundant fragments would violate the hypothesis of independence underlying the binomial significance test, resulting in a large number of false positive patterns.
Compute sequence lengths and their distribution.
Sequence lengths are useful for the negative control (selection of random genome fragments).
Sequence length distribution is informative to get an idea about the variability of peak lengths.
Compute compositional profiles, i.e. distributions of residues and dinucleotide frequencies per position (using position-analysis).
Residue profiles may reveal composition biases in the neighborhood of the peak sequences. Dinucleotide profiles can reveal (for example) an enrichment in CpG island.
Note that peak-motifs also runs position-analysis with larger oligonucleotide length (see option -l) to detect motifs on the basis of positionally biased oligonucleotides (see task positions).
This task combines various operations.
Perform various format conversion for the reference motif (compute parameters, consensus, logo).
Generate an enriched motif by scanning the peak sequence set with the reference motif.
Compare all discovered motifs with the reference motif.
Run oligo-analysis to detect over-represented oligonucleotides of a given length (k, specified with option -l) in the test set (van Helden et al., 1998). Prior frequencies of oligonucleotides are taken from Markov model of order m (see option -markov) estimated from the test set sequences themselves.
Run dyad-analysis to detect over-represented dyads, i.e. pairs of short oligonucleotides (monads) spaced by a region of fixed width but variable content (van Helden et al., 2000). Spaced motifs are typical of certain classes of transcription factors forming homo- or heterodimers.
By default, peak-motifs analyzes pairs of trinucleotides with any spacing between 0 and 20.
The expected frequency of each dyad is estimated as the product of its monad frequencies in the test sequences (option -bg monads of dyad-analysis).
Run position-analysis to detect oligonucleotides showing a positional bias, i.e. have a non-homogeneous distribution in the peak sequence set.
This method was initially developed to analyze termination and poly-adenylation signals in downstream sequences (van Helden et al., 2001), and it turns out to be very efficient for detecting motifs centred on the ChIP-seq peaks. For ChIP-seq analysis, the reference position is the center of each sequence.
Note that peak-motifs also uses position-analysis for the task composition, in order to detect compositional biases (residues, dinucleotides) in the test sequence set.
Run local-word-analysis to detect locally over-represented oligonucleotides and dyads.
The program local-word-analysis (Matthieu Defrance,unpublished) tests the over-representation of each possible word (oligo, dyad) in positional windows in the test sequence set.
Two types of background models are supported: (i) Markov model of order m estimated locally (within the window under consideration; (ii) the frequency observed for a word in the whole sequence set is used as estimator of the prior probability of this word in the window.
After our first trials, this program gives excellent results in ChIP-seq datasets, because its senstivitity increases with large number of sequences (several hundreds/thousands), and its background model is more stringent than for programs computing the global over-representation (oligo-analysis, dyad-analysis).
Compare each discovered motif to the reference motifs.
Compare each discovered motif to a database of known motifs (e.g. Jaspar, TRANSFAC, RegulonDB, UniProbe, ...)
Generate a log file summarizing the time spent in the different tasks.
Generate the HTML file providing a synthesis of the results and pointing towards the individual result files.
Extra tasks
A few extra tasks are available, which are not executed by default. Those tasks are executed only when they are explicitly invoked with the option -task, they are not called with the option "-task all".
Delete the purged sequence files after the analysis, in order to save space.
Compute meme background model from the test sequences.
Run the motif discovery program MEME on the test sequences.
Beware: the complexity of MEME is quadratic: the computing time increases as the square of sequence size. It is thus not recommended to use MEME for data sets exceeding 1Mb. If the test set contains many peaks, the option -task meme can be combined with a restriction on the number of top peaks to be considered (e.g. -top_peaks 500).
Maximal number of motifs (matrices) to return for motif discovery algorithms. Note the distinction between the maximal number of motifs (matrices) and the maximum number of patterns (words, dyads): a motif generally corresponds to mutually overlapping several patterns (dyads, words).
Oligonucleotide length for word-counting approaches (oligo-analysis, position-analysis, local-word-analysis, oligo-diff).
In our experience, optimal results are obtained with hexanucleotides and heptanucleotides.
Note: the monad length used for dyad-analysis is not affected by those options. Instead it is fixed to to 3. Indeed, dyad-analysis can detect larger motifs by sampling various spacings between the two trinucleotide monads.
Minimal (-minol) and maximal (-maxol) oligonucleotide lengths. If those options are used, the program iterated over the specified range of oligonucleotide lengths.
Order of the Markov model used to estimate expected oligonucleotide frequencies for oligo-analysis and local-word-analysis.
Higher order Markov models are more stringent, lower order are more sensitive, but tend to return a large number of false positives.
Markov models can be specified with either a positive or a negative value. Positive value indicate the length of the prefix in the transition matrix. Negative value indicate the order of the Markov model relative to the oligonucleotide length. For example, the option -markov -2 gives a model of order m=k-2 (thus, an order 5 for heptanucleotides, an order 4 for hexanucleotides).
The optimal Markov order depends on the number of sequences in the test set. Since ChIP-seq data typically contain hundreds to thoursands of peaks, high Markov orders are generally good, because they are stringent and still sensitive enough. In our experience, motifs are well detected with the most stringent Markov order (-markov -2).
A miminal and a maximal value can be specified for the Markov order. The program then iterates over all markov values between min_markov_order and max_markov_order.
Single-strand (-1str) or double-strand (-2str) analysis.
The default is double-strand analysis, since ChIP-seq results have no particular strand orientation.
Treatment of self-overlapping words for motif discovery: count (-ovlp) or do not count (-noov) overlapping occurrences. In -noov mode, only renewing occurrences are counted.
It is recommended to use the -noov mode (default) to avoid the effect of self-overlap, which violates the hypothesis of independence of successive occurrences underlying the binomial significance test (oligo-analysis, dyad-analysis).
Beware: the options -noov and -ovlp only apply to motif discovery, and not to compositional profiles. Dinucleotide frequencies are always computed with the option -ovlp (count all occurrences), to avoid weird effect. Since those compositin profiles further serve to estimate the probability of larger words, which may include repeated residues, we need to count all dinucleotide occurrences. Indeed with the -noov mode (renewing occurrences only), the transition tables of the first order Markov model would be unbalanced: the expected frequency of all the repeated dinucleotides (AA, TT, CC, GG) would be under-estimated, leading to an under-estimation of the expected frequency of repeat-containing words (e.g. AAAAAA, AAAGGG, ...).
Class interval for position-analysis.
The program peak-motifs combines a series of tried-and-tested programs which have been detailed in the following publications.
van Helden, J., Andre, B. and Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827-42.
van Helden, J., Rios, A. F. and Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 28, 1808-18.
van Helden, J., del Olmo, M. and Perez-Ortin, J. E. (2000). Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28, 1000-10.
Turatsinze, J. V., Thomas-Chollier, M., Defrance, M. and van Helden, J. (2008). Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc 3, 1578-88.
Estimate background models from control sequences, for oligo-analysis, dyad-analysis, and local-word-analysis. This should in principle reduce the rate of false positive.
For the Web server: generate temporary synthetic table showing the results already obtained so far, and finishing by a message "Partial results, please don't forget to reload the file later".
Compare all discovered motifs (plus reference motif if specified) and cluster them in order to extract a consensus motif.
Add a task to run Weeder on the peak sequences.
weederlauncher.out input organism large S M T5
Run oligo-analysis without any threshold in order to produce a plot of observed versus expected occurrences for all the oligonucleotides. This analysis is performed with the option -two_tails, which detects both under- and over-represented patterns.
- link to the directories for each algorithm/task
- link from the result page to the link table returned by position-analysis (file *_graph_index.html).