RSAT: convert-matrix manual

RSAT - convert-matrix manual

Name

2004 by

Description

Performs inter-conversions between various formats of position-specific scoring matrices (PSSM).

The program also performs a statistical analysis of the original matrix to provide different position-specific scores (weight, frequencies, information contents), general statistics (E-value, total information content), and synthetic descriptions (consensus).

PSSM can be used to represent the binding specificity of a transcription factor or the conserved residues of a protein domain.

Each row of the matrix corresponds to one residue (nucleotide or amino-acid depending on the sequence type). Each column corresponds to one position in the alignment. The value within each cell represents the frequency of each residue at each position.

INPUT/OUTPUT FORMATS

Some formats are supported only for input, others for output. There are more formats accepted for input, because the general use of this program is to convert a PSSM obtained from a database (e.g. TRANSFAC) or a pattern-discovery program (e.g. consensus, gibbs, meme, MotifSampler, ...) and obtain a matrix either for scanning (with matrix-scan) or for computing statistical parameters (see the return fields below). We generally use the TRANSFAC (tf) format, in which we can specify identifiers and names for the matrix.

TRANSFAC (input/output)

AC  MA0001.1
XX
ID  AGL3
XX
DE  MA0001.1 AGL3; from JASPAR
PO       A     C     G     T
1        0    94     1     2
2        3    75     0    19
3       79     4     3    11
4       40     3     4    50
5       66     1     1    29
6       48     2     0    47
7       65     5     5    22
8       11     2     3    81
9       65     3    28     1
10       0     3    88     6
XX
CC  program: jaspar
XX
//

tab (input/output)

The tab format accepts a user-specific set of return fields (option -return), proviging different statistics on the matrix (counts, frequencies, weights, information, other parameters: see description below).

     
; MET4 matrix, from Gonze et al. (2005). Bioinformatics 21, 3490-500.
A |   7   9   0   0  16   0   1   0   0  11   6   9   6   1   8
C |   5   1   4  16   0  15   0   0   0   3   5   5   0   2   0
G |   4   4   1   0   0   0  15   0  16   0   3   0   0   2   0
T |   0   2  11   0   0   1   0  16   0   2   2   2  10  11   8
//

JASPAR (input/output)

  
            > Mycn
            A [ 0 29 0 2 0 0 ]
            C [31 0 30 1 3 0 ]
            G [ 0 0 0 28 0 31]
            T [ 0 2 1 0 28 0 ]

MSCAN (input)

           >mef2
          10  0  0  0 22  0  6  2  3  4 22 10
           0  2 12  0  0  0  0  0  0  0  0  0
           9 20  2  0  0  0  0  0  0  0  0 10
           3  0  8 22  0 22 16 20 19 18  0  2
          >myf
           7  9  4  0 16  7  0  6  0  0  6  0
           8  0  2 15  0  0 15  0  0 10  0  0
           1  7 10  1  0  9  1  0 16  6  0 16
           0  0  0  0  0  0  0 10  0  0 10  0

meme (input)

 	Background letter frequencies
	A 0.303 C 0.183 G 0.209 T 0.306 

	MOTIF crp alternative name
	letter-probability matrix: alength= 4 w= 19 nsites= 17 E= 4.1e-009 
 0.000000  0.176471  0.000000  0.823529 
 0.000000  0.058824  0.647059  0.294118 
 0.000000  0.058824  0.000000  0.941176 
 0.176471  0.000000  0.764706  0.058824 
 0.823529  0.058824  0.000000  0.117647 
 0.294118  0.176471  0.176471  0.352941 
 0.294118  0.352941  0.235294  0.117647 
 0.117647  0.235294  0.352941  0.294118 
 0.529412  0.000000  0.176471  0.294118 
 0.058824  0.235294  0.588235  0.117647 
 0.176471  0.235294  0.294118  0.294118 
 0.000000  0.058824  0.117647  0.823529 
 0.058824  0.882353  0.000000  0.058824 
 0.764706  0.000000  0.176471  0.058824 
 0.058824  0.882353  0.000000  0.058824 
 0.823529  0.058824  0.058824  0.058824 
 0.176471  0.411765  0.058824  0.352941 
 0.411765  0.000000  0.000000  0.588235 
 0.352941  0.058824  0.000000  0.588235

meme_block (input)
CIS-BP (input)
Cluster-Buster (cb) (input/output)

	>element1 /name=element1
	0  4 2 14
	12 0 0 8
	8  0 1 11
	20 0 0 0
	....

STAMP and STAMP-transfac (input/output)

- the fields ID and AC are absent, and the matrix ID comes in the field DE
- the header row (PO) is not supported
- the positions start at 0 instead of 1
- there is no matrix delimiter (the double slash)

- " TRANSFAC" format, which is actually not TRANSFAC (the fields AC and ID are not defined).

                      NA Mync
                      XX
                      DE Mync
                      XX
                      P0 A C G T
                      01 0 31 0 0 C
                      02 29 0 0 2 A
                      03 0 30 0 1 C
                      04 2 1 28 0 G
                      05 0 3 0 28 T
                      06 0 0 31 0 G
                      XX

- "TRANSFAC-like" (same as above, but the first two rows are missing)

sequences (input)
patser (output)

This is actually the same format as tab (described above), but the only return field is the count matrix.

consensus (input/output)
gibbs (input)
MotifSampler (input/output)
alignAce (input)
Uniprobe (input)

Protein: Cbf1	Seed k-mer: ATCACGTG	Enrichment Score: 0.499010437669239
A:	0.251714422716682	0.231020715440932	0.371175995676819	0.343515826416987	0.189181911178663	0.373249743142318	0.159425685466501	0.387398837326962	0.160370450851774	0.00579566973382471	0.984310428811586	0.000520578518462409	0.0512168242470759	0.00554791387069823	0.00108871328362558	0.436684281349379	0.106429865986653	0.0872652424535894	0.2779359708333	0.222894293683715	0.366796870220836	0.226022414885529
C:	0.10226033847082	0.315992694980937	0.148489261324769	0.182792315701972	0.406736016253256	0.213860951366744	0.324485360588445	0.0418650553618826	0.045745403962552	0.99055073718171	0.0105040001038691	0.975244090256611	0.0064124125195024	0.00433861322483728	0.0016092288172844	0.262472975122313	0.184027817720692	0.549338793818378	0.127202171464537	0.198102864294932	0.306135553163069	0.321957177096839
G:	0.130212757211399	0.266959960914667	0.154092799416608	0.241158374156534	0.15046928890062	0.0930127274890034	0.354968645980097	0.554463521652852	0.0956960762910429	0.00242503126912309	0.0010685902059978	0.00922202192948052	0.94180586404824	0.00940766558113401	0.973242502288466	0.0857562483507934	0.0600857360004131	0.129208087201377	0.253172180381345	0.437904236128241	0.1940215817188	0.116641876697499
T:	0.515812481601099	0.186026628663464	0.326241943581804	0.232533483724507	0.253612783667461	0.319876578001935	0.161120307964957	0.0162725856583034	0.698188068894632	0.00122856181534235	0.00411698087854669	0.0150133092954459	0.000564899185181348	0.98070580732333	0.0240595556106241	0.215086495177514	0.649456580292242	0.234187876526655	0.341689677320817	0.141098605893113	0.133045994897295	0.335378531320133

Encode

>SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234

infogibbs (input/output)
assembly (input)
feature (input)
clustal (input)

RETURN FIELDS FOR THE TAB-DELIMITED OUTPUT FORMAT

counts

Each cell of the matrix indicates the number of occurrences of the residue at a given position of the alignment.

profile

The matrix is printed vertically (each matrix column becomes a row in the output text). Additional parameters (consensus, information) are indicated besides each position, and a histogram is drawed.

crude frequencies

Relative frequencies are calculated as the counts of residues divided by the total count of the column.

Fij=Cij/SUMi(Cij)

where

Cij: is the absolute frequency (counts) of residue i at position j of the alignment
Fij: is the relative frequency of residue i at position j of the alignment

frequencies corrected with pseudo-weights

Relative frequencies can be corrected by a pseudo-weight (b) to reduce the bias due to the small number of observations.

F''ij=Cij+b*Pi/[SUMi(Cij)+b]

where

Pi: is the prior frequency for residue i
b: is the pseudo-weight, which is ``shared'' between residues according to their prior frequencies.

weights

Weights are calculated according to the formula from Hertz (1999), as the natural logarithm of the ratio between the relative frequency (corrected for pseudo-weights) and the prior residue probability.

Wij=ln(F''ij/Pi)

information

The crude information content is calculated according to the formula from Hertz (1999).

Iij = Fij*ln(Fij/Pi)

In addition, we calculate a ``corrected'' information content which takes pseudo-weights into account.

I''ij = F''ij*ln(F''ij/Pi)

P-value

The P-value indicates the probability to observe at least Cij occurrences of a residue at a given position of the matrix. It is calculated with the binomial formula:

    k=C.j    C.j!      k      Cij-k
Pij= SUM  ---------- Pi (1-Pi)
    k=Cij k!(C.j-k)!

where

Cij: is the number of occurrences of residue i at position j of the matrix.
C.j: is the sum of all residue occurrences at position j of the matrix.
Pi: is the prior probability of residue i.

parameters

Returns a series of parameters associated to the matrix. The list of parameters to be exported depends on the input formats (each pattern discovery program returns specific parameters, which are more or less related to each others but not identical).

Some additional parameters are optionally calculated

consensus

The degenerate consensus is calculated by collecting, at each position, the list of residues with a positive weight. Contrarily to most applications, this consensus is thus weighted by prior residue frequencies: a residue with a high frequency might not be represented in the consensus if this frequency does not significantly exceed the expected frequency. Uppercases are used to highlight weights >= 1.

The consensus is exported as regular expression, and with the IUPAC code for ambiguous nucleotides (http://www.chem.qmw.ac.uk/iupac/misc/naseq.html).

        A                       (Adenine) 
        C                       (Cytosine)
        G                       (Guanine)
        T                       (Thymine)
        R       = A or G        (puRines)
        Y       = C or T        (pYrimidines)
        W       = A or T        (Weak hydrogen bonding)
        S       = G or C        (Strong hydrogen bonding)
        M       = A or C        (aMino group at common position)
        K       = G or T        (Keto group at common position)
        H       = A, C or T     (not G)
        B       = G, C or T     (not A)
        V       = G, A, C       (not T)
        D       = G, A or T     (not C)
        N       = G, A, C or T  (aNy)

The strict consensus indicates, at each position, the residue with the highest positive weight.

information

The total information is calculated by summing the information content of all the cells of the matrix. This parameters is already returned by the program consensus (Hertz), but not by other programs.

logo

Sequence logo, a visual representation of the motif, where each column of the matrix is represented as a stack of letters whose size is proportional to the corresponding residue frequency. The total height of each column is proportional to its information content.

Sequence logo are generated using the freeware program Weblogo.