RSAT - matrix-quality manual

NAME
DESCRIPTION

Positive set : annotated binding sites
Matrix sites
Cross-validation

k-fold cross-validation
Leave-One-Out (LOO) test
LOO or k-fold ?

Negative set

Random selection of biological sequences
Artificial sequences
Biological sequences scanned with column-permuted matrices

HOW TO USE THIS PROGRAM ?

Comparing the scores of the matrix sites to the theoretical
Assessing matrix sites with a Leave-One-Out (LOO) procedure

AUTHORS
CATEGORY
USAGE
OPTIONS
SEE ALSO
WISH LIST

NAME

matrix-quality

DESCRIPTION

Evaluate the quality of a Position-Specific Scoring Matrix (PSSM), by comparing score distributions obtained with this matrix in various sequence sets.

The most classical use of the program is to compare score distributions between "positive" sequences (e.g. true binding sites for the considered transcription factor) and "negative" sequences (e.g. intergenic sequences between convergently transcribed genes).

Positive set : annotated binding sites

The typical positive set is a collection of sites that have been shown (with experimental methods) to bind the transcription factor of interest.

Matrix sites

A particular case of postive control is to estimate the distribution of scores of the sites that served to build the matrix. This however provkes some bias (over-estimation of the scores), since the matrix is used to score the sites on which it was "trained". This bias can be circumvented by applying a cross-validation.

Cross-validation

An important bias of evaluation (and a frequent trap in published articles) can result from an over-fitting of the matrix to the positive set, in case one would use the same sites for building the PSSM and for evaluating it. To avoid this bias, matrix-quality supports two modes of cross-validation (CV):

 1. Leave-one-out (LOO)
 2. k-fold cross-validation (kfold)

The cross-validation can only be performed when the matrix is specified in a format that includes both the matrix and the sites (sequences) that were used to build this matrix. This is the case for matrices in MEME, consensus, transfac and MotifSampler formats.

k-fold cross-validation

The set of input sequence (matrix site sequences) is partitionned into k randomly selected subets of approx. equal size (the number of sites is not always an exact multiple of k).

The program then iterates over the testing set in the following way. All the sites that are not part of the testing sets are used as trianing sites to build a partial matrix. The testing sites are then scored with this partial matrix.

Leave-One-Out (LOO) test

In LOO cross-validation mode, one sequence (the "left-out sequence") is temporarily discarded from the positive set, and the remaining sequences are used to build a matrix, which is then used to score the left out sequence. The process iterates over all the sequences of the positive set.

If the left-out sequence has one or more "twin" (identical site) in the positive set, they are also temporarily excluded from the positive set and not included in the matrix used to score the left out sequence.

LOO or k-fold ?

The LOO is actually a particular case of k-fold cross-validation, where k equals the total number of sites used to build the original matrix. The LOO is particularly adapted for matrices built from a very small number of sites (e.g. matrices built from a handful of well-documented sites as usually found in transcription factor databases).

On the contrary, the k-fold cross-validation is useful to save computing time for matrices built from large collection of sites (e.g. thousands of sites resulting from ChIP-seq experiments).

Negative set

It is sometimes difficult to find a good negative set, i.e. a collection of sequences which supposedly do not contain any binding site for the transcription factor of interest.

Random selection of biological sequences

One possibility is to select a random set of genome fragments (e.g. use random-genes to select promoters of 100 randomly selected genes). However, some of these randomly selected sequences might contain effective binding sites for the transcripton factor.

Artificial sequences

Another possiblity is to generate artificial sequences according to some background model (uing random-seq), but there is always a risk that for model to be an over-simplification of the real sequences.

Biological sequences scanned with column-permuted matrices

Yet another approach to perform the negative test os to scan biological sequences (e.g. upstream regions of 100 randomly picked genes) with column-permuted matrices. The advantage of this approach is that the sequences are realistic, but the permuted matrices hopefully do not correspond to any actual motif, and their empirical distribution observed in the test sequences is thus supposed to fit the theoretcial distribution.

This approach may however pose problem in the specific case of weak-complexity motifs (e.g. CCGCCC, AATTTT), since many permutations will give motifs that are similar, if not equal, to the original motif.

HOW TO USE THIS PROGRAM ?

Let us be frank, this program can do many things, but requires a bit of expertise. A good strategy to get familiar with its multiple results is to start runing the simplest possible analysis, and progressively adding the more advanced tasks.

We propose hereafter a step-by-step schedule of utilization, where subsequent tasks are progressively added.

We assume here that the user disposes of a PSSM in a format that includes both the matrix and the aligned sites used to compute the matrix (e.g. MEME format). Beware, the sites actually incorporated in the matrix may differ frfom the collection of sites used as input for the matrix-building program. For instance, if you use MEME (with the option -zoops) to build a matrix from a collection of annotated TFBS, some sites may be incorporated in the matrix, and some other skipped. We use hereafter the expression "matrix sites" to refer to the sites used in the alignment from which the residues frequencies of the matrix were computed.

Comparing the scores of the matrix sites to the theoretical distribution

 matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \
   -no_cv -perm matrix_sites 0 -bgfile my_background.txt \
   -o my_matrix_quality

This will produce the simplest possible analysis: computing the score distribution of the matrix sites, and comparing it to the theoretical distribution.

Beware: the score distribution of matrix sites is fake. Indeed, those are the very stes that were used to build the matrix. Each site partly contributed to the matrix scores (weights) that will serve to score it. There is thus a problem of over-fitting: we train a matrix with some data, and we evaluate the matrix with the same data.

Assessing matrix sites with a Leave-One-Out (LOO) procedure

To circumvent the problem of over-fitting mentioned above, we have need to perform the Leave-One-Out (LOO) procedure. Actually, matrix-scan automatically runs the leave-one-out test by default. The reason why it was not done in the previous section is because we used the option -no_cv, for the only purpose of illustrating the problem of overfitting. We will now run matrix-scan in the normal way, without inactivating the LOO procedure.

 matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \
   -perm matrix_sites 0 -bgfile my_background.txt \
   -o my_matrix_quality

The result distributions now contain 3 curves:

theory: The theoretical distribution of scores, computing according to the background model;
matrix_sites: The score distribution of the matrix sites (which is biased by the fact that these sites were used to build the matrix).
matrix_sites_cv: This is the distribution of scores for the matrix sites, evaluated with the LOO procedure.

AUTHORS

Jacques van Helden <Jacques.van-Helden[at]univ-amu.fr>
Alejandra Medina-Rivera <amedina[at]liigh.unam.mx> (CCG, UNAM, Mexico)
Morgane Thomas-Chollier <morgane[at]bigre.ulb.ac.be>

USAGE

matrix-quality [-i inputfile] [-o outputfile] [-v]

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-dry

Dry run: print the commands but do not execute them.

-help

Same as -h

-m matrix_file

Matrix file. If the file includes several matrices, it will only take the first one.

-ms matrix_sites

File containing both a matrix and its sites. The sites are then used as positive sequence set, and labelled as "matrix_sites" in the distribution tables and graphs.

The option -ms is only valid with the file formats which contain both the matrix and its sites (e.g. consensus, MotifSampler, meme, infogibbs and transfac). The format of the matrix+site file can be specified with the option '-matrix_format'.

If the matrix and its sites are only available in separate files, an equivalent effect can be obtained by combining the options "-m my_matrix.tab" and "-seq matrix_sites site_sequences.fasta". Althougth when this option is used the LOO test is not performed.

If matrix-scan-quick is available in the machine this programe will be used instead of matrix-scan. For matrix-scan-quick the matrix most be in infogibbs or tab format.

If the file includes several matrices, it will only take the first one.

-matrix_format matrix_format

Format of the matrix file.

-seq seq_type seq_file

File containing a sequence set of a given type. The first next argument indicates the type of the sequence (which will appear in the leend of the plots), and the second next argument the file name.

-scanopt seq_type "option1 option2 ..."

Sequence set-specific options for matrix-scan. These options are added at the end of the matrix-scan command for scanning the specified sequence set.

-no_cv

Do not apply the leave-one-out (LOO) test on the matrix site sequences.

-kfold k

k-fold cross-validation.

Divide the matrix sites in k chunks for cross-validation. The chunks are sampled in a random way.

-noperm

Skip the matrix permutation step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

-noscan

Skip the matrix-scan step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

-nocompa

Skip the step of comparisons between distributions. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

-nograph

Skip the step of drawing comparison graphs.

-noicon

Do not generate the small graphs (icons) used for the galleries in the indexes.

-export_hits

Return matrix-scan scores in addition to the distribution of scores. Beware ! This option can produce very large files and use lots of disk space.

-perm seq_type #

Number of permutations for a specific set (default 0).

-perm_sep

Calculate the distributions for each permuted matrix separately. This provides an estimate of the variability between permutations, but the resulting graph is less readable, because of the multiplicity of curves.

Note: the option to merge permutations (-perm_merged) has been disactivated since we swapped from matrix-scan to matrix-scan-quick. The option -perm_sep is thus currently the only mode of presentation. We still need to implement the merging of the distributions, in order to re-activate the option -perm_merged (see with list).

-seq_format sequence_format

Sequence format.

-pseudo pseudo_counts

Pseudo-counts. The pseudo-count reflects the possibility that residues that were not (yet) observed in the model might however be valid for future observations. The pseudo-count is used to compute the corrected residue frequencies.

-th_prior background_file

Background model to be used to calculate the matrix theorical distribution. The matrix theorical distribution is calculated with matrix-distrib.

-bg_format background_file

Format for the background model file.

        Supported formats: all the input formats supported by
        convert-background-model.

-decimals #

Number of decimals for computing weight scores (default 2). This arguments is passed to matrix-scan and matrix-distrib.

-o output_prefix

Prefix of the output files. The program generates various files, and automatically adds a specific suffix to each output file.

pos_scores: Scores of the positive sequence set.

-graph_option 'option1 options2 ...'

Specify options that will be passed to the program XYgraph for generating the distributions and the ROC curves.

Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

  Example
   -graph_option '-size 800 -title "LexA matrix" -bg blue'

This option can be used iteratively on a command line.

  Example
   -graph_option '-xsize 1000' -graph_option '-title "LexA matrix"'

-roc_ref

Reference distribution for the ROC curve.

-roc_option 'option1 options2 ...'

Specify options that will be passed to the program XYgraph for generating the ROC curves (ot the distribution curves).

Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

  Example
   -roc_option '-ygstep1 0.1 -ygstep2 0.02'

This option can be used iteratively on a command line.

  Example
   -roc_option '-ygstep1 0.1' -roc_option '-ygstep2 0.02'

-distrib_option 'option1 options2 ...'

Specify options that will be passed to the program XYgraph for generating the distribution curves (not the ROC curves).

Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

  Example
   -distrib_option '-xmin -35 -xmax 20'

-img_format

Image format for the plots (ROC curve, score profiles, ...). To display the supported formats, type the following command: XYgraph -h.

Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.

Example: -img_format png,pdf

-logo_format

Image format for the sequence logos.

Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.

Example: -logo_format png,pdf

-nwd

The option will calculate the NWD data for the score distribution of the specified sequence set (Medina-Rivera, et al. 2010). At each frequency value (y-axis) we calculate the weigh difference (WD), defined as the difference between the observed Ws in all upstream non-codingsequence set and the expected Ws in the theoretical distribution of the PSSM for a given P-value.

The WD can be visualized as the horizontal distance between the distribution curves. As larger matrices allow higher scores, we divided the difference bye the matrix width to obtain the normalized weight difference.

Usage: -nwd seq_type

-archive

Compress the result directory into a zip archive of the same name (with suffix .zip).

 ## Title for html

-html_title

Get a title for the html page.

-task tasks

Specify one or several tasks to be run. If this option is not specified, all the tasks are run.

Note that some tasks depend on other ones. This option should thus be used with caution, by experimented users only.

Supported tasks:

scan

Scan sequences with matrix-scan

theor

Calculate the theoretical distribution

loo

Leave-one-out test on the matrix sites

theor_cv

Calculate the theoretical distribution of loo partial matrices

permute

Scan sequences with permuted matrices

compare

Compare distributions between the various input files

graphs

Draw the graphs with distrib comparisons

synthesis

Generate a HTML file with a synthetic report, which displays the main graphs (distribution curves and ROC curve) and provides links to the result files.

In order to be correctly indexed, the graphs have to be generated in png format.

nwd

Calculate the Normalized Weight Distance between the theoretical distribution and a score distribution in a specified sequence_type

Background model

matrix-distrib requires to specify a background model, which will be passed to matrix-distrib and matrix-scan. This background model can be specified with the same options as for matrix-scan.

Other options

All the other options are automatically passed to matrix-scan, in order to specify the scanning parameters (strands, background model, ...).

Note that the option '-return' of matrix-scan cannot be used here, because matrix-quality specifies the return fields required for its statistics.

If the option '-bgfile' is specified, the specified background model will be used to calculate the matrix theorical distribution. If another type of background model is specified for matrix-scan ('-bginput' or '-window'), use '-th_prior' option to specify the background model to be used for the calculation of the matrix theorical distribution.

WISH LIST

-perm_merged

Merge the permutations in order to obtain a more robust distribution of the permuted matrices. The figure is more readable than with the option -perm_sep (default), but does not reflect the variability between the different permutations.

-th_prior

File in oligo-analysis format.

This option should better be removed, so the user has to specify the bg file with the option -bgfile. To check.