Name
Description
Performs inter-conversions between various formats of
position-specific scoring matrices (PSSM).
The program also performs a statistical analysis of the original
matrix to provide different position-specific scores (weight,
frequencies, information contents), general statistics (E-value, total
information content), and synthetic descriptions (consensus).
PSSM can be used to represent the binding specificity of a transcription
factor or the conserved residues of a protein domain.
Each row of the matrix corresponds to one residue (nucleotide or
amino-acid depending on the sequence type). Each column corresponds
to one position in the alignment. The value within each cell
represents the frequency of each residue at each position.
INPUT/OUTPUT FORMATS
Some formats are supported only for input, others for output. There
are more formats accepted for input, because the general use of this
program is to convert a PSSM obtained from a database (e.g. TRANSFAC)
or a pattern-discovery program (e.g. consensus, gibbs, meme,
MotifSampler, ...) and obtain a matrix either for scanning (with
matrix-scan) or for computing statistical parameters (see the return
fields below). We generally use the TRANSFAC (tf) format, in which we can specify identifiers and names for the matrix.
- TRANSFAC (input/output)
Format used in the TRANSFAC database
AC MA0001.1
XX
ID AGL3
XX
DE MA0001.1 AGL3; from JASPAR
PO A C G T
1 0 94 1 2
2 3 75 0 19
3 79 4 3 11
4 40 3 4 50
5 66 1 1 29
6 48 2 0 47
7 65 5 5 22
8 11 2 3 81
9 65 3 28 1
10 0 3 88 6
XX
CC program: jaspar
XX
//
tab (input/output) tab-delimited
file. One row per residue, one column per position. The first
column of each row indicates the residue, the following columns
give the frequency of that residue at the corresponding position
of the matrix.
The tab format accepts a user-specific set of return fields (option
-return), proviging different statistics on the matrix (counts,
frequencies, weights, information, other parameters: see description
below).
; MET4 matrix, from Gonze et al. (2005). Bioinformatics 21, 3490-500.
A | 7 9 0 0 16 0 1 0 0 11 6 9 6 1 8
C | 5 1 4 16 0 15 0 0 0 3 5 5 0 2 0
G | 4 4 1 0 0 0 15 0 16 0 3 0 0 2 0
T | 0 2 11 0 0 1 0 16 0 2 2 2 10 11 8
//
JASPAR (input/output)
http://jaspar.genereg.net/html/TEMPLATES/help.html
> Mycn
A [ 0 29 0 2 0 0 ]
C [31 0 30 1 3 0 ]
G [ 0 0 0 28 0 31]
T [ 0 2 1 0 28 0 ]
MSCAN (input)
http://www.cisreg.ca/cgi-bin/mscan/MSCAN
>mef2
10 0 0 0 22 0 6 2 3 4 22 10
0 2 12 0 0 0 0 0 0 0 0 0
9 20 2 0 0 0 0 0 0 0 0 10
3 0 8 22 0 22 16 20 19 18 0 2
>myf
7 9 4 0 16 7 0 6 0 0 6 0
8 0 2 15 0 0 15 0 0 10 0 0
1 7 10 1 0 9 1 0 16 6 0 16
0 0 0 0 0 0 0 10 0 0 10 0
meme (input)
Output file from MEME, the pattern-discovery program developed by
tim Bailey.This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
http://meme.nbcr.net/meme/doc/meme-format.html
Background letter frequencies
A 0.303 C 0.183 G 0.209 T 0.306
MOTIF crp alternative name
letter-probability matrix: alength= 4 w= 19 nsites= 17 E= 4.1e-009
0.000000 0.176471 0.000000 0.823529
0.000000 0.058824 0.647059 0.294118
0.000000 0.058824 0.000000 0.941176
0.176471 0.000000 0.764706 0.058824
0.823529 0.058824 0.000000 0.117647
0.294118 0.176471 0.176471 0.352941
0.294118 0.352941 0.235294 0.117647
0.117647 0.235294 0.352941 0.294118
0.529412 0.000000 0.176471 0.294118
0.058824 0.235294 0.588235 0.117647
0.176471 0.235294 0.294118 0.294118
0.000000 0.058824 0.117647 0.823529
0.058824 0.882353 0.000000 0.058824
0.764706 0.000000 0.176471 0.058824
0.058824 0.882353 0.000000 0.058824
0.823529 0.058824 0.058824 0.058824
0.176471 0.411765 0.058824 0.352941
0.411765 0.000000 0.000000 0.588235
0.352941 0.058824 0.000000 0.588235
meme_block (input) older format from MEME
CIS-BP (input)
Format used in the CIS-BP database.
Similar to transfac, but without the AC/ID lines, and Position
line labeled with Pos instead of PO.
Cluster-Buster (cb) (input/output)
cluster-buster output file (usual extention .cb), which can be
used as input by various other programs (clover, trap). The
header line starts with a > (like in fasta format). The matrix
is then printed "vertically" on the following lines: each
column corresponds to one residue, and each row to a position
in the alignment. For TRAP (Roider et al, Bioinformatics,
2007), the "/name=" is necessary for the program to work.
>element1 /name=element1
0 4 2 14
12 0 0 8
8 0 1 11
20 0 0 0
....
STAMP and STAMP-transfac (input/output)
Converts the matrix from/to a string in STAMP format
(http://www.benoslab.pitt.edu/stamp/help.html).
STAMP is a dialect of the TRANSFAC format, with important differences:
- - the fields ID and AC are absent, and the matrix ID comes in the field DE
- - the header row (PO) is not supported
- - the positions start at 0 instead of 1
- - there is no matrix delimiter (the double slash)
In addition, STAMP admits two variants:
sequences (input)
Create a matrix from a FASTA sequence file containing the pre-aligned sites.
The method just reads the sequences and counts the residue frequencies
at each position.
patser (output)
This format can be used as input to scan sequences with
patser, the pattern-matching program developed by Jerry Hertz.
This is actually the same format as tab (described above), but
the only return field is the count matrix.
consensus (input/output)
Output file from consensus, the pattern-discovery program
developed by Jerry Hertz (Hertz et al., Comput Appl Biosci,
1990:6, 81-92). This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
gibbs (input)
Output file from gibbs, the pattern-discovery program
developed by Andrew Neuwald (Lawrence et al. Science, 1993:
262, 208-214; Neuwald, et al. Protein Sci, 1995: 4, 1618-1632)
MotifSampler (input/output)
Output file from MotifSampler, the pattern-discovery program
developed by Gert Thijs (Thijs et al. Bioinformatics, 2001:17,
1113-1122).
alignAce (input)
Uniprobe (input)
http://the_brain.bwh.harvard.edu/uniprobe/downloads.php
Protein: Cbf1 Seed k-mer: ATCACGTG Enrichment Score: 0.499010437669239
A: 0.251714422716682 0.231020715440932 0.371175995676819 0.343515826416987 0.189181911178663 0.373249743142318 0.159425685466501 0.387398837326962 0.160370450851774 0.00579566973382471 0.984310428811586 0.000520578518462409 0.0512168242470759 0.00554791387069823 0.00108871328362558 0.436684281349379 0.106429865986653 0.0872652424535894 0.2779359708333 0.222894293683715 0.366796870220836 0.226022414885529
C: 0.10226033847082 0.315992694980937 0.148489261324769 0.182792315701972 0.406736016253256 0.213860951366744 0.324485360588445 0.0418650553618826 0.045745403962552 0.99055073718171 0.0105040001038691 0.975244090256611 0.0064124125195024 0.00433861322483728 0.0016092288172844 0.262472975122313 0.184027817720692 0.549338793818378 0.127202171464537 0.198102864294932 0.306135553163069 0.321957177096839
G: 0.130212757211399 0.266959960914667 0.154092799416608 0.241158374156534 0.15046928890062 0.0930127274890034 0.354968645980097 0.554463521652852 0.0956960762910429 0.00242503126912309 0.0010685902059978 0.00922202192948052 0.94180586404824 0.00940766558113401 0.973242502288466 0.0857562483507934 0.0600857360004131 0.129208087201377 0.253172180381345 0.437904236128241 0.1940215817188 0.116641876697499
T: 0.515812481601099 0.186026628663464 0.326241943581804 0.232533483724507 0.253612783667461 0.319876578001935 0.161120307964957 0.0162725856583034 0.698188068894632 0.00122856181534235 0.00411698087854669 0.0150133092954459 0.000564899185181348 0.98070580732333 0.0240595556106241 0.215086495177514 0.649456580292242 0.234187876526655 0.341689677320817 0.141098605893113 0.133045994897295 0.335378531320133
Encode
http://compbio.mit.edu/encode-motifs/
>SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234
infogibbs (input/output)
Output file from RSAT infogibbs.
infogibbs is a gibbs sampler based on the optimization of the
information content of the matrix (rather than the weight of
the sampled segments). infogibbs was developed by Matthieu De France.
assembly (input)
Output file from the program RSAT pattern-assembly. One assembly
file can contain zero, one or several assemblies. Each
assembly is converted to a position-specific scoring matrix by
taking, for each residue at each position, the score of the
most significant pattern (oligonucleotide) containing that
residue in this position of the assembly.
feature (input)
Output file from RSAT convert-features.
This format allows to obtain a PSSM from a list of (supposedly
pre-aligned) sites. These sites can themselves have been
collected by scanning sequences with a matrix (matrix-scan) or
by searching string-based patterns in a sequence
(dna-pattern).
Converting features to matrices can for example be useful for
iterative refinment of a matrix (colecting sites from a
matrix, and building a matrix from those sites).
Another application is to detect oligomers or dyads in a
sequence set, and build a matrix from these.
clustal (input)
The popular multiple alignemnt program clustalw.
RETURN FIELDS FOR THE TAB-DELIMITED OUTPUT FORMAT
- counts
-
Each cell of the matrix indicates the number of occurrences of the
residue at a given position of the alignment.
- profile
-
The matrix is printed vertically (each matrix column becomes a row in
the output text). Additional parameters (consensus, information) are
indicated besides each position, and a histogram is drawed.
- crude frequencies
-
Relative frequencies are calculated as the counts of residues divided
by the total count of the column.
-
Fij=Cij/SUMi(Cij)
-
where
- Cij
-
is the absolute frequency (counts) of residue i at position j of the alignment
- Fij
-
is the relative frequency of residue i at position j of the alignment
- frequencies corrected with pseudo-weights
-
Relative frequencies can be corrected by a pseudo-weight (b) to reduce
the bias due to the small number of observations.
-
F''ij=Cij+b*Pi/[SUMi(Cij)+b]
-
where
- Pi
-
is the prior frequency for residue i
- b
-
is the pseudo-weight, which is ``shared'' between residues according to
their prior frequencies.
- weights
-
Weights are calculated according to the formula from Hertz (1999), as
the natural logarithm of the ratio between the relative frequency
(corrected for pseudo-weights) and the prior residue probability.
-
Wij=ln(F''ij/Pi)
- information
-
The crude information content is calculated according to the formula
from Hertz (1999).
-
Iij = Fij*ln(Fij/Pi)
-
In addition, we calculate a ``corrected'' information content which
takes pseudo-weights into account.
-
I''ij = F''ij*ln(F''ij/Pi)
- P-value
-
The P-value indicates the probability to observe at least Cij
occurrences of a residue at a given position of the matrix. It is
calculated with the binomial formula:
-
k=C.j C.j! k Cij-k
Pij= SUM ---------- Pi (1-Pi)
k=Cij k!(C.j-k)!
-
where
- Cij
-
is the number of occurrences of residue i at position j of
the matrix.
- C.j
-
is the sum of all residue occurrences at position j of the
matrix.
- Pi
-
is the prior probability of residue i.
- parameters
-
Returns a series of parameters associated to the matrix. The list of
parameters to be exported depends on the input formats (each pattern
discovery program returns specific parameters, which are more or less
related to each others but not identical).
-
Some additional parameters are optionally calculated
- consensus
-
The degenerate consensus is calculated by collecting, at each
position, the list of residues with a positive weight. Contrarily to
most applications, this consensus is thus weighted by prior residue
frequencies: a residue with a high frequency might not be represented
in the consensus if this frequency does not significantly exceed the
expected frequency. Uppercases are used to highlight weights >= 1.
-
The consensus is exported as regular expression, and with the IUPAC
code for ambiguous nucleotides (http://www.chem.qmw.ac.uk/iupac/misc/naseq.html).
-
A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine)
R = A or G (puRines)
Y = C or T (pYrimidines)
W = A or T (Weak hydrogen bonding)
S = G or C (Strong hydrogen bonding)
M = A or C (aMino group at common position)
K = G or T (Keto group at common position)
H = A, C or T (not G)
B = G, C or T (not A)
V = G, A, C (not T)
D = G, A or T (not C)
N = G, A, C or T (aNy)
-
The strict consensus indicates, at each position, the residue with the
highest positive weight.
- information
-
The total information is calculated by summing the information content
of all the cells of the matrix. This parameters is already returned by
the program consensus (Hertz), but not by other programs.
- logo
-
Sequence logo, a visual representation of the motif, where each column
of the matrix is represented as a stack of letters whose size is
proportional to the corresponding residue frequency. The total height
of each column is proportional to its information content.
Sequence logo are generated using the freeware
program Weblogo.