RSAT: retrieve-seq manual

RSAT - retrieve-seq manual

Name

1997-98 by

Description

Options

Genes

All: all genes of the selected organism are considered
Selection: user-specified list of genes
- gene list typed directly in the text area.
- Upload: select a text file on your computer that contains the list of genes

Several queries can be entered simultaneously, separated by carriage returns. The first word of each line is the query; any following information is ignored.
Valiad queries are gene identifiers (eg: YFL021W) or gene names (eg: GAL4, NIL1).
By default, synonyms are accepted (eg: NIL1 = GAT1), but only one gene name is returned as sequence identifier (GAT1 in this case).

Query contains only IDs (no synonyms)

The option IDs only indicates that the input queries contain only IDs, no name or synonym. This avoids to load the table of synonyms, and reduces thus the delay to obtain the result.

Organism

Single organism

Organism

Multiple organisms

The first column indicates the ID or the name of a query gene.
The second column incitaed the organism to which the gene belongs.

Example

NP_310394.1     Escherichia_coli_O157H7
NP_313053.1     Escherichia_coli_O157H7
NP_416175.1     Escherichia_coli_K12
NP_418467.1     Escherichia_coli_K12
NP_753947.1     Escherichia_coli_CFT073
NP_756866.1     Escherichia_coli_CFT073
NP_288094.1     Escherichia_coli_O157H7_EDL933
NP_290677.1     Escherichia_coli_O157H7_EDL933

Warning

Feature type

CDS: coding sequences (from start to stop codon, unspliced)
mRNA:messenger RNA
tRNA:transfer RNA
rRNA:ribosomial RNA
scRNA:

The availability of some sequence types depends on the genome. For example, some Genbank flat files contain annotations about CDSs but no mRNA (e.g. bacterial annotations from the NCBI). Some other genome contain separate annotations for CDS and mRNA (e.g. A.thaliana). When mRNAs are annotated in Genbank, their coordinates are stored and can be used.

The advantage of using mRNA is that, if the mRNA is complete (which is not always the case), the upstream regions aretrieved relative to the transcription initiation site, rather than the start codon.

Remarks

One gene can be associated to multiple CDSs and to multiple mRNAs.
Many annotated "mRNAs" seem to be actually CDS (e.g. in June 2003, 12,000 out of 27,000 mRNAs from A.thaliana start with ATG).

Sequence type

Upstream sequences located upstream the coding region. The origin is at the start codon.
Downstream sequences located downstream the coding region. The origin is at the stop codon.
Unspliced CDS DNA sequences located between the start and stop codons. WARNING: introns are not spliced out (this will be implemented in further versions)

Sequence limits (from, to)

Sign

negative values return sequence located upstream the origin
positive values return sequences downstream the origin
The origin itself depends on the sequence type, see above)

Default values for upstream sequence retrieval

For yeast, we generally obtain good results with upstream regions from -800 to -1. About 99% of the known upstream elements are comprized between these limits (source: Transfac).
For bacteria, the distribution of regulatory sites depends on the mode of regulation :
- transcriptional repressors generally bind proximally, and may overlap the transcription initiation or even be located downstream. A good guess is from -200 to +50.
- Binding sites for transcriptional activators have a more distal distribution (-400 to -1).
The default is from -400 to -1 from the start codon (since we currently do not have annotations about transcription initiation sites).
The default values for each organism can be obtained with the program supported-organisms.

Prevent overlap with neighbour genes:

When the option is checked, upstream sequences are automatically clipped when a predicted gene is located within the range defined by the option from. The actual size retainedfor the upstream sequence is indicated in the sequence comments.

Note that in some cases a known regulatory element is located upstream or within a predicted gene. This means either that the predited gene is an artifact, or that the same sequence is bifunctional (coding and regulatory).

It is particularly important to activate this option when working with bacteria, since many genes are located in operons, and have a very close upstream neighbour.

Admit imprecise positions:

Schizosaccharomyces pombe

Arabidopsis thaliana

By default, these genes are not loaded. The option "Admit imprecise positions" allows to retrieve sequence for these genes as well, using the imprecise coordinate as reference position.

Mask repeats

This option allows to use the genome version where repeats are masked (i.e. replaced by 'N' characters). The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome. This option is only valid for organisms with annotated repeats.

List of organisms with annotated repeats

Anopheles_gambiae_EnsEMBL
Caenorhabditis_elegans_EnsEMBL
Canis_familiaris_EnsEMBL
Ciona_intestinalis_EnsEMBL
Danio_rerio_EnsEMBL
Drosophila_melanogaster_EnsEMBL
Gallus_gallus_EnsEMBL
Homo_sapiens_EnsEMBL
Mus_musculus_EnsEMBL
Oryzias_latipes_EnsEMBL
Pan_troglodytes_EnsEMBL
Rattus_norvegicus_EnsEMBL
Tetraodon_nigroviridis_EnsEMBL

Output sequence format:

raw: the raw sequence without any identifier or comment.
multi: several raw sequences concatenated.
IG: IntelliGenetics format.
FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).

Sequence label

gene identifier
gene name
gene id + gene name
full: a concatenation of gene identifier, gene name, sequence type, from, to and strand. This option gives a full description of the conditions of sequence retrieval