RSAT - get-orthologs-compara manual

DESCRIPTION

Returns orthologues, plus optionally paralogues and homeologues, for a set of genes in one or more organisms. Relies on primary data from Ensembl Compara.

AUTHORS

Bruno Contreras Moreira
Jacques van Helden

INPUT FORMAT

Query gene IDs can be directly entered or uploaded from an input file (option -i). Only Ensembl stable gene IDs are accepted, such as AT1G01140. Instead, gene names such as CIPK9 are not valid. The first word of each row of such a file is handled as a gene. Any additional information on the same row is ignored.

HOMOLOGY TYPES

Supported homology types and subtypes:

ortholog
paralog
homeolog

Note that in Compara polyploids such as wheat are separated by subgenome (ie A,B,D) and labels (one2one, one2many, etc) are not reevaluated after the subgenomes are merged.

OUTPUT FORMAT and RETURN FIELDS

A tab-separated file with seven columns. Each row of the output file describes a homology relation between a query gene and a target gene. Output contains the following columns:

target_id
ref_organism
type
query_id
query_organism
ident_target
ident_query

HOMOLOGY CRITERIA

Ensembl Compara gene orthology and paralogy predictions are generated by a pipeline where maximum likelihood phylogenetic gene trees play a central role. They aim to represent the evolutionary history of gene families, i.e. genes that diverged from a common ancestor. These gene trees reconciled with their species tree have their internal nodes annotated to distinguish duplication or speciation events, and thus support the annotation of orthologous and paralagous genes, which can be part of complex one-to-many and many-to-many relations.

Pairs of homologous sequences are scored in terms of % amino acid identity, which is calculated with respect to query and target in order to capture length and possibly domain content differentes. For example, if the species selected is Arabidopsis thaliana, and the homologue is in maize, the query sequence is the A.thaliana protein and the target is the maize protein. ident_query in this case is the % of the query identical to the maize protein, and ident_target is the % of the maize protein identical to the A.thaliana protein. These % identities will only be the same if the length (number of amino acids) of both sequences are the same.

[Adapted from: http://www.ensembl.org/info/genome/compara/homology_method.html].