matrix-quality
Evaluate the quality of a Position-Specific Scoring Matrix (PSSM), by comparing score distributions obtained with this matrix in various sequence sets.
The most classical use of the program is to compare score distributions between "positive" sequences (e.g. true binding sites for the considered transcription factor) and "negative" sequences (e.g. intergenic sequences between convergently transcribed genes).
The typical positive set is a collection of sites that have been shown (with experimental methods) to bind the transcription factor of interest.
A particular case of postive control is to estimate the distribution of scores of the sites that served to build the matrix. This however provkes some bias (over-estimation of the scores), since the matrix is used to score the sites on which it was "trained". This bias can be circumvented by applying a cross-validation.
An important bias of evaluation (and a frequent trap in published articles) can result from an over-fitting of the matrix to the positive set, in case one would use the same sites for building the PSSM and for evaluating it. To avoid this bias, matrix-quality supports two modes of cross-validation (CV):
1. Leave-one-out (LOO) 2. k-fold cross-validation (kfold)
The cross-validation can only be performed when the matrix is specified in a format that includes both the matrix and the sites (sequences) that were used to build this matrix. This is the case for matrices in MEME, consensus, transfac and MotifSampler formats.
The set of input sequence (matrix site sequences) is partitionned into k randomly selected subets of approx. equal size (the number of sites is not always an exact multiple of k).
The program then iterates over the testing set in the following way. All the sites that are not part of the testing sets are used as trianing sites to build a partial matrix. The testing sites are then scored with this partial matrix.
In LOO cross-validation mode, one sequence (the "left-out sequence") is temporarily discarded from the positive set, and the remaining sequences are used to build a matrix, which is then used to score the left out sequence. The process iterates over all the sequences of the positive set.
If the left-out sequence has one or more "twin" (identical site) in the positive set, they are also temporarily excluded from the positive set and not included in the matrix used to score the left out sequence.
The LOO is actually a particular case of k-fold cross-validation, where k equals the total number of sites used to build the original matrix. The LOO is particularly adapted for matrices built from a very small number of sites (e.g. matrices built from a handful of well-documented sites as usually found in transcription factor databases).
On the contrary, the k-fold cross-validation is useful to save computing time for matrices built from large collection of sites (e.g. thousands of sites resulting from ChIP-seq experiments).
It is sometimes difficult to find a good negative set, i.e. a collection of sequences which supposedly do not contain any binding site for the transcription factor of interest.
One possibility is to select a random set of genome fragments (e.g. use random-genes to select promoters of 100 randomly selected genes). However, some of these randomly selected sequences might contain effective binding sites for the transcripton factor.
Another possiblity is to generate artificial sequences according to some background model (uing random-seq), but there is always a risk that for model to be an over-simplification of the real sequences.
Yet another approach to perform the negative test os to scan biological sequences (e.g. upstream regions of 100 randomly picked genes) with column-permuted matrices. The advantage of this approach is that the sequences are realistic, but the permuted matrices hopefully do not correspond to any actual motif, and their empirical distribution observed in the test sequences is thus supposed to fit the theoretcial distribution.
This approach may however pose problem in the specific case of weak-complexity motifs (e.g. CCGCCC, AATTTT), since many permutations will give motifs that are similar, if not equal, to the original motif.
Let us be frank, this program can do many things, but requires a bit of expertise. A good strategy to get familiar with its multiple results is to start runing the simplest possible analysis, and progressively adding the more advanced tasks.
We propose hereafter a step-by-step schedule of utilization, where subsequent tasks are progressively added.
We assume here that the user disposes of a PSSM in a format that includes both the matrix and the aligned sites used to compute the matrix (e.g. MEME format). Beware, the sites actually incorporated in the matrix may differ frfom the collection of sites used as input for the matrix-building program. For instance, if you use MEME (with the option -zoops) to build a matrix from a collection of annotated TFBS, some sites may be incorporated in the matrix, and some other skipped. We use hereafter the expression "matrix sites" to refer to the sites used in the alignment from which the residues frequencies of the matrix were computed.
matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \ -no_cv -perm matrix_sites 0 -bgfile my_background.txt \ -o my_matrix_quality
This will produce the simplest possible analysis: computing the score distribution of the matrix sites, and comparing it to the theoretical distribution.
Beware: the score distribution of matrix sites is fake. Indeed, those are the very stes that were used to build the matrix. Each site partly contributed to the matrix scores (weights) that will serve to score it. There is thus a problem of over-fitting: we train a matrix with some data, and we evaluate the matrix with the same data.
To circumvent the problem of over-fitting mentioned above, we have need to perform the Leave-One-Out (LOO) procedure. Actually, matrix-scan automatically runs the leave-one-out test by default. The reason why it was not done in the previous section is because we used the option -no_cv, for the only purpose of illustrating the problem of overfitting. We will now run matrix-scan in the normal way, without inactivating the LOO procedure.
matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \ -perm matrix_sites 0 -bgfile my_background.txt \ -o my_matrix_quality
The result distributions now contain 3 curves:
The theoretical distribution of scores, computing according to the background model;
The score distribution of the matrix sites (which is biased by the fact that these sites were used to build the matrix).
This is the distribution of scores for the matrix sites, evaluated with the LOO procedure.
matrix-quality [-i inputfile] [-o outputfile] [-v]
Level of verbosity (detail in the warning messages during execution)
Display full help message
Dry run: print the commands but do not execute them.
Same as -h
Matrix file. If the file includes several matrices, it will only take the first one.
File containing both a matrix and its sites. The sites are then used as positive sequence set, and labelled as "matrix_sites" in the distribution tables and graphs.
The option -ms is only valid with the file formats which contain both the matrix and its sites (e.g. consensus, MotifSampler, meme, infogibbs and transfac). The format of the matrix+site file can be specified with the option '-matrix_format'.
If the matrix and its sites are only available in separate files, an equivalent effect can be obtained by combining the options "-m my_matrix.tab" and "-seq matrix_sites site_sequences.fasta". Althougth when this option is used the LOO test is not performed.
If matrix-scan-quick is available in the machine this programe will be used instead of matrix-scan. For matrix-scan-quick the matrix most be in infogibbs or tab format.
If the file includes several matrices, it will only take the first one.
Format of the matrix file.
File containing a sequence set of a given type. The first next argument indicates the type of the sequence (which will appear in the leend of the plots), and the second next argument the file name.
Sequence set-specific options for matrix-scan. These options are added at the end of the matrix-scan command for scanning the specified sequence set.
Do not apply the leave-one-out (LOO) test on the matrix site sequences.
k-fold cross-validation.
Divide the matrix sites in k chunks for cross-validation. The chunks are sampled in a random way.
Skip the matrix permutation step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.
Skip the matrix-scan step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.
Skip the step of comparisons between distributions. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.
Skip the step of drawing comparison graphs.
Do not generate the small graphs (icons) used for the galleries in the indexes.
Return matrix-scan scores in addition to the distribution of scores. Beware ! This option can produce very large files and use lots of disk space.
Number of permutations for a specific set (default 0).
Calculate the distributions for each permuted matrix separately. This provides an estimate of the variability between permutations, but the resulting graph is less readable, because of the multiplicity of curves.
Note: the option to merge permutations (-perm_merged) has been disactivated since we swapped from matrix-scan to matrix-scan-quick. The option -perm_sep is thus currently the only mode of presentation. We still need to implement the merging of the distributions, in order to re-activate the option -perm_merged (see with list).
Sequence format.
Pseudo-counts. The pseudo-count reflects the possibility that residues that were not (yet) observed in the model might however be valid for future observations. The pseudo-count is used to compute the corrected residue frequencies.
Background model to be used to calculate the matrix theorical distribution. The matrix theorical distribution is calculated with matrix-distrib.
Format for the background model file.
Supported formats: all the input formats supported by convert-background-model.
Number of decimals for computing weight scores (default 2). This arguments is passed to matrix-scan and matrix-distrib.
Prefix of the output files. The program generates various files, and automatically adds a specific suffix to each output file.
Scores of the positive sequence set.
Specify options that will be passed to the program XYgraph for generating the distributions and the ROC curves.
Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.
Example -graph_option '-size 800 -title "LexA matrix" -bg blue'
This option can be used iteratively on a command line.
Example -graph_option '-xsize 1000' -graph_option '-title "LexA matrix"'
Reference distribution for the ROC curve.
Specify options that will be passed to the program XYgraph for generating the ROC curves (ot the distribution curves).
Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.
Example -roc_option '-ygstep1 0.1 -ygstep2 0.02'
This option can be used iteratively on a command line.
Example -roc_option '-ygstep1 0.1' -roc_option '-ygstep2 0.02'
Specify options that will be passed to the program XYgraph for generating the distribution curves (not the ROC curves).
Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.
Example -distrib_option '-xmin -35 -xmax 20'
Image format for the plots (ROC curve, score profiles, ...). To display the supported formats, type the following command: XYgraph -h.
Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.
Example: -img_format png,pdf
Image format for the sequence logos.
Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.
Example: -logo_format png,pdf
The option will calculate the NWD data for the score distribution of the specified sequence set (Medina-Rivera, et al. 2010). At each frequency value (y-axis) we calculate the weigh difference (WD), defined as the difference between the observed Ws in all upstream non-codingsequence set and the expected Ws in the theoretical distribution of the PSSM for a given P-value.
The WD can be visualized as the horizontal distance between the distribution curves. As larger matrices allow higher scores, we divided the difference bye the matrix width to obtain the normalized weight difference.
Usage: -nwd seq_type
Compress the result directory into a zip archive of the same name (with suffix .zip).
## Title for html
Get a title for the html page.
Specify one or several tasks to be run. If this option is not specified, all the tasks are run.
Note that some tasks depend on other ones. This option should thus be used with caution, by experimented users only.
Supported tasks:
Scan sequences with matrix-scan
Calculate the theoretical distribution
Leave-one-out test on the matrix sites
Calculate the theoretical distribution of loo partial matrices
Scan sequences with permuted matrices
Compare distributions between the various input files
Draw the graphs with distrib comparisons
Generate a HTML file with a synthetic report, which displays the main graphs (distribution curves and ROC curve) and provides links to the result files.
In order to be correctly indexed, the graphs have to be generated in png format.
Calculate the Normalized Weight Distance between the theoretical distribution and a score distribution in a specified sequence_type
matrix-distrib requires to specify a background model, which will be passed to matrix-distrib and matrix-scan. This background model can be specified with the same options as for matrix-scan.
All the other options are automatically passed to matrix-scan, in order to specify the scanning parameters (strands, background model, ...).
Note that the option '-return' of matrix-scan cannot be used here, because matrix-quality specifies the return fields required for its statistics.
If the option '-bgfile' is specified, the specified background model will be used to calculate the matrix theorical distribution. If another type of background model is specified for matrix-scan ('-bginput' or '-window'), use '-th_prior' option to specify the background model to be used for the calculation of the matrix theorical distribution.
Called by matrix-quality for scanning the different sets (positive, negative) with the input matrix.
Called by matrix-quality for computing the theoretical distribution of scores.
Called by matrix-quality to generate column-permuted matrices.
Merge the permutations in order to obtain a more robust distribution of the permuted matrices. The figure is more readable than with the option -perm_sep (default), but does not reflect the variability between the different permutations.
File in oligo-analysis format.
This option should better be removed, so the user has to specify the bg file with the option -bgfile. To check.