PATSER manual

RSAT - Help about patser

Copyright 1990, 1994, 1995, 1996, 2000, 2001 Gerald Z. Hertz
May be copied for noncommercial purposes.

Author:
  Gerald Z. Hertz
  hertz@colorado.edu



PATSER (version 3d)

This program scores the L-mers (subsequences of length L) of the
indicated sequences against the indicated alignment or weight matrix.
The elements of an alignment matrix are simply the number of times
that the indicated letter is observed at the indicated position of a
sequence alignment.  Such elements must be processed before the matrix
can be used to score an L-mer (e.g., Hertz and Stormo, 1999,
Bioinformatics, 15:563-577).  A weight matrix is a matrix whose
elements are in a form considered appropriate for scoring an L-mer.

Each element of an alignment matrix is converted to an element of a
weight matrix by first adding pseudo-counts in proportion to the a
priori probability of the corresponding letter (see option "-b" in
section 1 below) and dividing by the total number of sequences plus
the total number of pseudo-counts.  The resulting frequency is
normalized by the a priori probability for the corresponding letter.
The final quotient is converted to an element of a weight matrix by
taking its natural logarithm.  The use of pseudo-counts here differs
from previous versions of this program by being proportional to the a
priori probability.

Version 3 of this program differs from previous versions by also
numerically estimating the p-value of the scores.  The p-value
calculated here is the probability of observing a particular score or
higher at a particular sequence position and does NOT account for the
total amount of sequence being scored.  P-values are estimated by the
method described in Staden, 1989, CABIOS, p. 89--96.  The relative
value for each element of the weight matrix is approximated by
integers in a range determined by the "-R" and "-M" options (section 6
below).  The p-value is calculated for each possible integer score and
the values are stored.  The actual scores for the sequences are
determined from the true weight matrix.  The true scores are converted
to their corresponding integer values and their p-values are looked up.

Matrices can be either horizontal or vertical.  In a horizontal
matrix, the columns correspond to the positions within the pattern,
and the rows correspond to the letters.  Each row begins with the
corresponding letter (or integer, if the "-i" option is used).  In a
vertical matrix, the rows correspond to the positions within the
pattern, and the columns correspond to the letters.  The first row
contains the letters (or integers, if the "-i" option is used)
corresponding to each column.  In both types of matrices, spaces,
tabs, and vertical bars (|) are ignored.  The output of the "consensus"
and "wconsensus" programs consists of horizontal alignment matrices.

The input files can contain comments according to the following
convention.  The portion of a line following a ';', '%', or '#' is
considered a comment and is ignored.  Comments can begin anywhere in a
line and always end at the end of the line.  The output of this
program is sent to the standard output.

The following options can be determined on the command line.

  0) -h: print these directions.


  1) Matrix options.
     -m filename: (default name is "matrix") file containing the matrix.
     -w: the matrix is a weight matrix (default: alignment matrix)

     -b number: a non-negative number indicating the total number of
         pseudo-counts added to each alignment position (default: 1).
         Before converting an alignment matrix to a weight matrix, the
         total pseudo-counts multiplied by the a priori probability
         (see section 3 below) of the corresponding letter is added
         to each matrix element.
     -v: the matrix is a vertical matrix (default: horizontal matrix).
     -p: print the weight matrix derived from the alignment matrix.

  2) -f filename: this file (default: read from the standard input) contains
        the names of the sequences.  The corresponding sequence may follow
        its name if the sequence is enclosed between backslashes (\).
        Otherwise, the sequence is assumed to be in a separate file having
        the indicated name.

        In the sequences, whitespace, slashes (/), periods, dashes (unless
        part of an integer when the "-i" option is used), and comments
        beginning with ';', '%', or '#' are ignored.  When using letter
        characters (i.e., with the "-a" or "-A" alphabet option), integers
        are also ignored so that the sequence file can contain positional
        information.  When using integer characters (i.e., with the "-i"
        alphabet option) the integers must be separated by whitespace.

        A "-c" preceding the name of a sequence file indicates that the
        corresponding sequence is circular.


  3) Alphabet options---the three options in this section are mutually
     exclusive (default: "-a alphabet").  The a priori probabilities mentioned
     below are used when converting an alignment matrix to a weight matrix.
     -a filename: file containing the alphabet and normalization information.

        Each line contains a letter (a symbol in the alphabet) followed by an
        optional normalization number (default: 1.0).  The normalization is
        based on the relative a priori probabilities of the letters.  For
        nucleic acids, this might be be the genomic frequency of the bases
        or the frequencies observed in the data used to generate the alignment.
        In nucleic acid alphabets, a letter and its complement appear on the
        same line, separated by a colon (a letter can be its own complement,
        e.g. when using a dimer alphabet). Complementary letters may use the
        same normalization number.  Only the standard 26 letters are
        permissible; however, when the "-CS" option is used, the alphabet is
        case sensitive so that a total of 52 different characters are possible.

        POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
        letter
        letter normalization

        POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
        letter:complement
        letter:complement normalization
        letter:complement normalization:complement's_normalization

     -i filename: same as the "-a" option, except that the symbols of
        the alphabet are represented by integers rather than by letters.
        Any integer permitted by the machine is a permissible symbol.

     -A alphabet_and_normalization_information: same as "-a" option, except
        information appears on the command line (e.g., -A a:t 3 c:g 2).

  4) Alphabet modifiers indicating whether ascii alphabets are case
     sensitive---the two options in this section are mutually exclusive
     with each other and with the "-i" option (default: ascii alphabets are
     case insensitive).
     -CS: ascii alphabets are case sensitive.
     -CM: ascii alphabets are case insensitive, but mark the location
          of lowercase letters by printing a line containing their locations.
          This option is useful when lowercase letters indicate a functional
          landmark such as a transcriptional start in a DNA sequence.

  5) Options for adjusting or restricting which scores are printed.
     The "-ls", "-li", and "-lp" options are mutually exclusive.

     -c: also score the complementary sequences.  The complements are
        determined by the program and are not explicitly stated in the
        sequence files.

     -ls number: lower threshold for printing scores, inclusive
                 (formerly the -l option).
     -li: assume that the maximum ln(p-value) for printing scores equals
          the negative of the sample-size adjusted information content;
          indirectly determines the lower threshold for printing scores.
     -lp number: the maximum ln(p-value) for printing scores; indirectly
                 determines the lower threshold for printing scores.

     -u number: upper threshold for printing scores, exclusive.


     -t: just print the top score for each sequence.
     -t number: print the indicated number of top scores for each sequence.
     -ds: if the "-t number" option is used, print the top scores for each
          sequence in the order of decreasing score (default: order the
          scores according to their position within the sequence).
     -e number: the small difference for considering 2 scores equal
                (default: 0.000001)

  6) Options indicating how unrecognized symbols are treated (default: -d1).
     Symbols are letters when option "-a" or "-A" is used;
     symbols are integers when option "-i" is used.
     The following three options are mutually exclusive.
     -d0: treat unrecognized symbols as errors and exit the program.
     -d1: treat unrecognized symbols as discontinuities, but print a warning.
          Treating a symbol as a discontinuity means that any L-mer
          containing the unrecognized symbol will be ignored.
     -d2: treat unrecognized symbols as discontinuities, and print NO warning.

  7) Options for adjusting the estimation of p-value.
     If the -R option is set to zero, the p-value is not estimated.
     -R number: the range for approximating a column of the weight matrix with
                integers (default: 10000).  This number is the difference
                between the largest and smallest integers used to estimate
                the scores.  Higher values increase precision, but will take
                longer to calculate the table of p-values.
     -M number: the minimum score for approximating p-values (default: 0).
                Higher values will increase precision,
                but may miss interesting scores.