RSAT - retrieve-seq manual
Name
Description
Returns upstream, downstream or coding sequences for one or several genes.
Options
NP_310394.1 Escherichia_coli_O157H7
NP_313053.1 Escherichia_coli_O157H7
NP_416175.1 Escherichia_coli_K12
NP_418467.1 Escherichia_coli_K12
NP_753947.1 Escherichia_coli_CFT073
NP_756866.1 Escherichia_coli_CFT073
NP_288094.1 Escherichia_coli_O157H7_EDL933
NP_290677.1 Escherichia_coli_O157H7_EDL933
Warning
For the organism names, all spaces must be replaced by the underscore
character, as whosn in the example.
Feature type
Currently supported:
- CDS: coding sequences (from start to stop codon, unspliced)
- mRNA:messenger RNA
- tRNA:transfer RNA
- rRNA:ribosomial RNA
- scRNA:
The availability of some sequence types depends on the genome. For
example, some Genbank flat files contain annotations about CDSs but no
mRNA (e.g. bacterial annotations from the NCBI). Some other genome
contain separate annotations for CDS and mRNA
(e.g. A.thaliana). When mRNAs are annotated in Genbank, their
coordinates are stored and can be used.
The advantage of using mRNA is that, if the mRNA is complete (which is
not always the case), the upstream regions aretrieved relative to the
transcription initiation site, rather than the start codon.
Remarks
- One gene can be associated to multiple CDSs and to multiple mRNAs.
- Many annotated "mRNAs" seem to be actually CDS (e.g. in June 2003,
12,000 out of 27,000 mRNAs from A.thaliana start with ATG).
Sequence type
Currentlly supported:
- Upstream sequences located upstream the coding region. The
origin is at the start codon.
- Downstream sequences located downstream the coding
region. The origin is at the stop codon.
- Unspliced CDS DNA sequences located between the start and
stop codons. WARNING: introns are not spliced out (this will be
implemented in further versions)
Sequence limits (from, to)
Limits of the region to retrieve. Coordintates are calculated relative
to the start of the coding sequence.
Sign
- negative
values return sequence located upstream the origin
- positive
values return sequences downstream the origin
The origin itself depends on the sequence type, see above)
Default values for upstream sequence retrieval
- For yeast, we generally obtain good results with upstream
regions from -800 to -1. About 99% of the known upstream elements are
comprized between these limits (source: Transfac).
- For bacteria, the distribution of regulatory sites depends
on the mode of regulation :
- transcriptional repressors generally bind proximally, and
may overlap the transcription initiation or even be located downstream. A
good guess is from -200 to +50.
- Binding sites for
transcriptional activators have a more distal distribution
(-400 to -1).
The default is from -400 to -1 from the start codon (since we
currently do not have annotations about transcription initiation
sites).
- The default values for each organism can be obtained with the
program supported-organisms.
Prevent overlap with neighbour genes:
It is quite frequent to find a predicted gene in close proximity
upstream from a query gene. If you want to discard these sequences from
your analysis, you should make sure this option is active.
When the option is checked, upstream sequences are automatically
clipped when a predicted gene is located within the range defined by
the option from. The actual size retainedfor the upstream
sequence is indicated in the sequence comments.
Note that in some cases a known regulatory element is located
upstream or within a predicted gene. This means either that the
predited gene is an artifact, or that the same sequence is bifunctional
(coding and regulatory).
It is particularly important to activate this option when working with
bacteria, since many genes are located in operons, and have a very
close upstream neighbour.
Admit imprecise positions:
In the annotations of some genomes, the limits of some genes are
imprecisely specified, by indicating an upper limit (e.g. <555245) or
a lower limit (e.g. >898098) rather than a precise value. Such
annotations can be found for example in the genomes
of Schizosaccharomyces pombe, Arabidopsis thaliana.
By default, these genes are not loaded. The option "Admit imprecise
positions" allows to retrieve sequence for these genes as well, using
the imprecise coordinate as reference position.
Mask repeats
This option allows to use the genome version where repeats are masked (i.e. replaced by 'N' characters).
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.
This option is only valid for organisms with annotated repeats.
List of organisms with annotated repeats
Anopheles_gambiae_EnsEMBL
Caenorhabditis_elegans_EnsEMBL
Canis_familiaris_EnsEMBL
Ciona_intestinalis_EnsEMBL
Danio_rerio_EnsEMBL
Drosophila_melanogaster_EnsEMBL
Gallus_gallus_EnsEMBL
Homo_sapiens_EnsEMBL
Mus_musculus_EnsEMBL
Oryzias_latipes_EnsEMBL
Pan_troglodytes_EnsEMBL
Rattus_norvegicus_EnsEMBL
Tetraodon_nigroviridis_EnsEMBL
Output sequence format:
The result can be displayed in various sequence formats (click on the links for more details).
- raw: the raw sequence without any identifier or comment.
- multi: several raw sequences concatenated.
- IG: IntelliGenetics format.
- FastA: the sequence format used by FastA, BLAST, Gibbs sampler and a lot of other bioinformatic programs.
- Wconsensus: the format defined by Jerry Hertz for his programs (patser, consensus, wconsensus).
Sequence label
Sequences can be labeled (named) in different ways:
- gene identifier
- gene name
- gene id + gene name
- full: a concatenation of gene identifier, gene name, sequence type, from, to and strand. This option gives a full description of the conditions of sequence retrieval