compare-matrices
$program_version
Compare two collections of position-specific scoring matrices (PSSM), and return various similarity statistics + matrix alignments (pairwise, one-to-n).
compare-matrices -file1 inputfile1 -file2 inputfile2 [-o outputfile] [-v #] [...]
The user has to specify exactly input files (options -file1 and -file2), each containing one or several PSSMs. Each matrix of file one is compared with each matrix of file2.
Any PSSM format supported in RSAT (type convert-matrix -h for a description).
By default, the output format is a tab-delimited file with one row per matrix comparison, and one column per statistics. Depending on the requested return fields, compare-matrices can also export a series of additional files.
Tab-delimited text file containing the primary result (comparison score table): one column per comparison (match or profile position), one row per field (score, matrix descriptor, ...).
HTML file presenting the comparison table in a user-friendly way. The clickable headers allow to re-order the table according to any column.
Tab-delimited text file containing the shifted matrices resulting from pairwise alignments.
HTML file presentig the pairwise alignments in a user-friendly way: motifs are presented as sequence logos.
Tab-delimited text file containing the shifted matrices resulting from 1-to-n alignments.
HTML file presentig the 1-to-n alignments in a user-friendly way: motifs are presented as sequence logos.
The program successively computes one or several (dis)similiraty metrics between each matrix of the first input file and each matrix of the secnd input file.
Since the matrices are not supposed to be in phase, for each pair of matrix, the program tests all possible offset (shift) values between the two matrices.
In the formula below, symbols are defined as follows
Two position-specific scoring matrices.
Number of columns of matrices m1 and m2, respectively.
Number of rows in each matrix, which correspond to the number of residues in the alphabet (A,C,G,T for DNA motifs).
Number of aligned columns between matrices m1 and m2 (depends on the offset between the two matrices).
w <= w1 w <= w2
Total length of the alignent between matrices m1 and m2.
W = w1 + w2 - w
A measure of the mutual overlap between the aligned matrices.
Wr = w / W
This actually corresponds to the Jaccard coefficient (intersection / union), applied to the alignment lengths.
Number of sites in matrices m1 and m2, respectively.
Number of cells in the aligned portion of the matrices.
n = w * r
Index of a row of the aligned PSSM (corresponds to a residue).
Index of a column of the aligned PSSM (corresponds to an aligned position).
Frequency of residue r in the jth column of the aligned subset of the first matrix (taking the offset into account).
Frequency of residue r in the jth column of the aligned subset of the second matrix (taking the offset into account).
Mean frequency computed over all cells of matrices m1 and m2, respectively.
BEWARE: this metrics is the real SSD, i.e. the simple sum of squared distance. It is a distance metric, in contrast with the "SSD" defined in STAMP, which is converted to a similarity metrics (see Sandelin-Wasserman below).
SSD = SUM{i=1->r} SUM{j=1->w} [(f1{i,j} - f2{i,j})^2)]
Also implemented in STAMP (under the name SSD) and TOMTOM (under the name Sandelin-Wasserman). This is a distance to similarity conversion of the SSD. The conversion is ensured by substracting each squared distance to a constant 2 (the max distance between two columns containing relative frequencies, i.e. one residue has frequency 1 in one column, and another residue has ffrequency 1 in the other column).
SW = SUM{i=1->r} SUM{j=1->w} [2 - (f1{i,j} - f2{i,j})^2) ]
Source: Sandelin A & Wasserman WW (2004) J Mol Biol 338:207-215.
Sandelin-Wasserman (SW) similarity normalized by the number of aligned columns (w).
NSW = SW / (2*w)
NSW takes a value comprized between 0 (not a single corresponding residue) and 1 (matrices are identical for all the aligned columns).
dEucl = sqrt( SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f2{i,j})^2)
Since relative frequencies can take values from 0 to 1, the Euclidian distance can take values from 0 to sqrt(2)*w.
Euclidian distance normalized by the number of aligned columns (w).
NdEucl = dEucl / w
NdEucl can take values from 0 to sqrt(2)
.
Note that this differs from the definition provided in Pape et al. (2008).
A similarity metrics derived from the normalized Euclidian distance.
NsEucl = (Max(NdEucl) - NdEucl) / Max(NdEucl) = (sqrt(2) - NdEucl) / sqrt(2)
where Max(NdEucl)=sqrt(2) is the maximal possible Euclidian distance for the current pair of matrices. The Normalized Euclidian similarity can vary from 0 (idential matrices) to 1 (matrices with a single residue per column, and those residues systematically differ between the two matrices).
As defined in Aerts et al. (2003). Also called Mutual Information.
dKL = 1/(2w) * SUM{i=1->r} SUM{j=1->w} ( f1{i,j}*log(f1{i,j}/f2{i,j}) + f2{i,j}*log(f2{i,j}/f1{i,j}))
Note that the KL distance is problematic for matrices containing zero values: for example, if f1(i,k)=0 and f2(i,j)=1, we have : KL(i,j) = (0*log(0) + 1*log(1/0)) = 0 + log(Inf) = Inf
One can circumvent this problem by using pseudo-count corrected matrices (f'(i,j)), but then the KL distance is strongly dependent on the somewhat arbitrary choice of the pseudo-count value.
cov = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m) * (f2{i,j} - f2m)
Beware : this is the classical covariance defined in statistical textbooks. It has nothing to do with the "natural covariance" of Pape (which still needs to be implemented here). What we compute here is simply the covariance between the counts in the aligned cells of the respective matrices.
v1 = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m)^2 v2 = 1/n * SUM{i=1->r} SUM{j=1->w} (f2{i,j} - f2m)^2 cor = cov/ sqrt(v1*v2)
The normalized correlation prevents matches covering only a small fraction of the matrix (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).
The normalization factor is the relative length (Wr), i.e. the number of aligned columns divided by the total columns of the alignment.
Ncor = cor * Wr
This correction is particularly important to avoid selecting spurious alignments between short fragments of the flanks of the matrices (e.g. single-column alignments). For this reasons, Ncor generally gives a better estimation of motif similarity than cor, and we recommend it as similarity score.
Imposing a too stringent lower threshold on Ncor may however reduce the sensitivity, and in particular prevent from detecting matches between half-motifs (e.g. in the case of dimeric transcription factor recognizing composite motifs).
An alternative would be to use as normalizing factor the length of the alignment (w) relative to the length of the shorter motif.
Ncor = cor * w / min(w1,w2)
This however tends to favour matches between very short motifs (4-5 residues) which cover only a fraction of the query motif.
Pearson's correlation computed on the information content matrices (I1, I2) rather than on the frequencies.
Icov = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m) * (I2{i,j} - f2m) Iv1 = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m)^2 Iv2 = 1/n * SUM{i=1->r} SUM{j=1->w} (I2{i,j} - f2m)^2 cor = Icov/ sqrt(Iv1*Iv2)
The Icor score fixes a weakness of the cor score and all other other metrics above, which only take into account the residue frequencies whilst ignoring the background frequencies.
A typical manifestation of this problem is that the cor score occasionally returns alignements between non-informative pieces of the matrices , which appear flat on the aligned logos. The reason why uninformative columns may have a good correlation is that, if both matrices have the same compositional bias (for example 30%A, 20%C, 20%G and 30%T), they will be correlated. Consequently, the columns reflecting the background will contribute to increase the correlation coefficient.
The information content corrects this bias by relativizing the matrix frequencies with respect to the background residue probaiblities.
I{i,j} = f{i,j} log (f{i,j}/p{j})
where p{j} is the prior probability of residue j.
Distances between PSSMs have been treated in many ways. The most recent and relevant articles are cited hereafter.
Level of verbosity (detail in the warning messages during execution)
Display full help message
Same as -h
The first input file containing one or several matrices.
The second input file containing one or several matrices.
Use a single matrix file as input. Each matrix of this file is compared to each other. This is equivalent to: -file1 single_matrix_file -file2 single_matrix_file
The fisrt input file contaning a list of matrix files (given as paths)
The second input file contaning a list of matrix files (given as paths) The reverse complement is computed for this set of matrices.
Specify the matrix format for the first input file only (requires -format2).
Specify the matrix format for the second input file only (requires -format1).
Specify the matrix format for both input files (alternatively, see options -format1 and -format2).
Background model file.
Format for the background model file.
Supported formats: all the input formats supported by convert-background-model.
Only analyze the first X motifs of the first file. This options is convenient for quick testing before starting the full analysis.
Only analyze the first X motifs of the second file. This options is convenient for quick testing before starting the full analysis.
Prefix for the output files. The output prefix is mandatory for some return fields (alignments, graphs, ...).
This prefix will be appended with a series of suffixes for the different output types (see section OUTPUT FORMATS above for the detail).
Return matches between any matrix of the file1 and any matrix of file2.
This is the typical use of compare-matrices: comparing one or several query motifs (e.g. obtained from motif discovery) with a collection of reference motifs (e.f. a database of experimentally characterized transcription factor binding motifs, such as JASPAR, TRANSFAC, RegulonDB, ...).
For a given pair of matrices (one from file1 and one from file2), the program tests all possible offsets, and measures one or several matching scores (see section "(Dis)similarity metrics" above). The program only returns the sore of the best alignemnt between the two matrices. The "best" alignement is the combination of offset and strand (with the option -strand DR) that maximizes the default score (Ncor). Alternative scores can be used as optimality criteria with the option -sort.
Return a table with one row for each possible alignment offset between two matrices, and various columns indicating the matching parameters (offset, strand, aligned width,...), the matching scores, and the consensus of the aligned columns of the matrices.
Matching profiles are convenient for drawing the similarity profiles, or for analyzing the correlations between various similarity metrics, but they are too verbosy for the typical use of compare-matrices (detect matches between a query matrix and a database of reference matrices). The formats "matches" and "table" are more convenient for basic use.
Skip comparison between a matrix and itself.
This option is useful when the program is sused to compare all matrices of a given file to all matrices of the same file, to avoid comparing each matrix to itself.
Beware: the criterion for considering two matrices identical is that they have the same identifier. If two matrices have exactly the same content (in terms of occurrences per position) but different identifiers, they will be compared.
Perform matrix comparisons in direct (D) reverse complementary (R) or both orientations (DR, default option).
When the R or DR options are activated, all matrices of the second matrix file are converted to the reverse complementary matrix.
This option is useful to answer very particular questions, for example
DNA-binding motifs are usually strand-insensitive. A motif may be detected in one given orientation by a motif-discovery algorithm, but annotated in the reverse complementary orientation in a motif database. For DNA binding motifs, we thus recomment the DR option.
On the contrary, RNA-related signals (termination, poly-adenylation, miRNA) are strand-sensitive, and should be compared in a single orientation (-strand D).
An example of reverse complementary palindromic motif is tCAGswwsGTGa. When a motif is reverse complementary palindromic, the matrix is correlated to its own reverse complement.
Remark about a frequent misconception of biological palindromes
Reverse complementary palindroms are frequent in DNA signals (e.g. transcription factor binding sites, restriction sites, ...) because they correspond to a rotational symmetry in the 3D structure. Such symmetrical motifs are often characteristic of sites recognized by homodimeric complexes.
By contrast, simple string-based palindromes (e.g. CAGTTGAC) do absolutely not correspond to any symmetry on the biochemical point of view, because the 3D structure of the corresponding double helix is not symmetrical. The apparent symmetry is an artifact of the string-based representation, but the corresponding molecule has neither rotational nor translational symmetry.
DNA signals can either be symmetrical (reverse complementary palindromes, tandem repeats) or asymmetrical.
Obsolete option for returning matrix names, Replaced by -return matrix_name. Maintained for backward compatibility.
List of fields to return (only valid for the formats "profiles" and "matches").
Supported return fields:
ascending (default for the profile mode)
decreasing (default for the matching mode)
decreasing
decreasing
ascending
decreasing
decreasing
ascending
ascending
decreasing
ascending
Number of the matrices in the input files
Identifiers of the matrices
Names of the matrices
Width of the matrices and the alignment
Direct (D) or Reverse complementary (R) comparison
Offset between the positions of the first and second matrix
Relative positions the aligned matrices (start, end, strand, width)
Shifted matrices resulting from the pairwise alignments.
Shifted matrices resulting from the 1-to-N alignments.
Shifted matrices resulting from the alignments (pairwise and 1-to-N).
All supported output fields, including all metrics.
Field to sort the results. The sorting direction depends on the metric: ascending for dissimilarity metrics, decreasing for similarity metrics.
Supported sort fields:
ascending (default for the profile mode)
decreasing (default for the matching mode)
decreasing
decreasing
ascending
decreasing
decreasing
ascending
ascending
decreasing
ascending
Threshold on some parameter (-lth: lower, -uth: upper threshold).
Supported threshold fields : rank, dEucl, cor, cov, ali_len, offset
We should check if this fixes the problems of 0 values that we have with the KL distance.
Pape, U. J., Rahmann, S. and Vingron, M. (2008). Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics 24, 350-7.
This metrics measures the covariance between hits of two matrices above a given threshold for each of them.
Note that a condition of applicability of the chi2 P-value is that the expected value should be >= 5 for each cell of the matrix. This condition is usually not fulfilled for the PSSM we use for motif scanning.
Source: Wang T & Stormo GD (2003) Bioinformatics 19:2369-2380 Also implemented in STAMP.
Pseudo-counts to be added to all matrices.
Cluster motifs (only valid with a single input file).
Export a table with one row per matrix of the file 1, one column per matrix of file 2, where each cell indicates the value of the selected field for the corresponding pair of matrices.
Export a graph where nodes correspond to input matrices, and edges indicate similarities between them.