Bio::Tools::Phylo::PAML HOWTO This document is copyright Aaron Mackey, 2002. For reproduction other than personal use please contact me at amackey@virginia.edu Done. Abstract PAML is a package of C programs that implement Phylogenetic Analyses using Maximum Likelihood, written by Dr. Ziheng Yang, University College London. These programs implement a wide variety of models to explore the evolutionary relationships between sequences at either the protein, codon or raw DNA level. This document's aim is to explore and document how the BioPerl PAML parser and result objects "work". ____________________________________________________ Table of Contents [1]Background [2]Accessing PAML results [3]New Section Title Background The PAML package consists of many different executable programs, but the BioPerl Bio::Tools::Phylo::PAML object (hereafter referred to as simply the PAML object) focuses on dealing with the output of the main analysis programs "baseml", "codeml" (sometimes called "aaml") and "codemlsites" (a batch version of "codeml"). All of these programs use maximum likelihood methods to fit a mathematical model of evolution to sequence data provided by the user. The main difference between these programs is the type of sequence on which they operate (baseml for raw DNA, codeml for DNA organized as codons, aaml for amino acids). While the general maximum likelihood approach used by the PAML programs is the same for all of them, the specific evolutionary models available for each sequence type vary greatly, as do the parameters specific to each model. The programs function in a handful of disparate modes, each requiring slight variations of inputs that can possibly include: 1. multiply-aligned sequences. representing 1 or more distinct genes [ PAML parameter Mgene = 1 ], in 1 or more distinct datasets [ PAML ndata > 1 ]) 2. a user-provided tree topology (or multiple tree topologies to be evaluated and contrasted) 3. a set of instructions in a control file that specify the model (or models) to be used, various options to specify how to handle the sequence data (e.g. whether to dismiss columns with gaps or not [ cleandata = 1 ]), initial or fixed values for model parameters, and the filenames for other input data. The output from PAML is directed to multiple "targets": data is written to the user-specified primary output file (conventionally named with an .mlc extension), as well as various accessory files with fixed names (e.g. 2ML.t, 2ML.dN, 2ML.dS for pairwise Maximum Likelihood calculations) that appear in the same directory that the output file is found. The upshot of these comments is that one PAML program "run" can potentially generate results for many genes, many datasets, many tree toplogies and many evolutionary models, spread across multiple output files. Currently, the PAML programs deal with the various categories of multiple analyses in the following "top-down" order: datasets, genes, models, tree topologies. So how shall the BioPerl PAML module treat these sources of multiple results? _________________________________________________________________ Accessing PAML results The BioPerl PAML result parser takes the view that a distinct "recordset" or single, top-level PAML::Result object represents a single dataset. Each PAML::Result object may therefore contain data from multiple genes, models, and/or tree topologies. To parse the output from a multiple-dataset PAML run, the familiar "next_result" iterator common to other BioPerl modules is invoked. Example 1. Iterating over results with next_result use Bio::Tools::Phylo::PAML; my $parser = new Bio::Tools::Phylo::PAML (-file => "./output.mlc", -dir => "./", -ctlf => "./codeml.ctl"); while(my $result = $parser->next_result) { # do something with the results from this dataset ... } In this example, we've created a new top-level PAML parser, specifying PAML's primary output file, the directory in which any other accessory files may be found, and the control file. We then trigger the parser to begin parsing the data, returning a new PAML::Result object for each dataset found in the output. The PAML::Result object provides access to the wide variety of data found in the output files. The specific kinds of data available depends on which PAML analysis program was run, and the mode and models employed. Generally, these include a recapitulation of the input sequences and their multiple alignment (which may differ slightly from the original input sequences due to the data "cleansing" PAML performs), descriptive statistics of the input sequences (e.g. codon usage tables, nucleotide or amino acid composition), pairwise Nei & Gojobori (NG) calculation matrices (for codon models), fitted model parameter values (including branch-specific parameters associated with any provided tree topology), reconstructed ancestral sequences (again, associated with an accompanying tree topology), or statistical comparisons of multiple tree topologies. _________________________________________________________________ New Section Title Text here. References 1. file://localhost/tmp/html-xULDRz#BACKGROUND 2. file://localhost/tmp/html-xULDRz#RESULTS 3. file://localhost/tmp/html-xULDRz#NEW