RSA-tools - tutorials - retrieve-seq

RSA-tools - Tutorials - retrieve-seq

In the left frame, under the title Sequence retrieval, click on the tool "retrieve sequence". A form appears, allowing you to select the parameters of sequence retrieval.
Retrieving sequences for a selected set of genes
We will first retrieve upstream sequences for a set of 5 yeast genes involved in yeast phosphate metabolism.
In the pop-up menu "Organism", select Saccharomyces cerevisiae
In the text area under "Genes", type the following list of genes
YBR093C
PHO8
pho11
PHO81
Pho84
Leave all other options unchanged and click on the button GO
After a few seconds, the result is displayed, in the form of a link to the sequences, followed by a series of Buttons.
The reason why sequences are not directly displayed is that users usually don't need to see them, since they are retrieved for subsequent analysis with other tools. This is particularly useful to avoid transferring large amounts of data over the web when you are analyzing large sequence sets (e.g. several hundreds of genes). The default output mode (in the retrieve-seq form) is thus to maintain the output on the server, rather than displaying them on your browser window.
Even if this server mode has been used, you still have the possibility to check the sequences a posteriori. In the result page, click on the link, in order to see the upstream sequence of the selected genes. Check the sizes of the sequences obtained. Notice that one seqence is shorter than the other ones. Try to figure out the reason for this(if you don't know the answer, don't worry, the answer is below).
Click on the Back button to come back to the previous result page. Below the link, you can see a series of buttons, which will allow you to send the retrieved sequences to the next task (as we will see in the next tutorials). For the time being, do not click these buttons.
Remarks

Gene names must be separated by carriage returns. Actually, only the first word of each line is considered as a query. You can add additional information on a line, but all the text following the first word of a line will be ignored.
Notice that gene names are case-insensitive.
Genes can be specified either by their identifier (e.g. YFL021W) or by their common name (e.g. GAT1).
Synonyms are supported: a gene can be associated to several names. For example, in S.cerevisiae, the ORF YFL021W is associated with the names GAT1, MEP80 and NIL1. Multipe names are quite frequent in yeast; they come from the fact that this gene was characterized in different laboratories independently, and people realized only later that they had isolated the same gene. The list of synonyms of each gene can be obtained with the tool gene-information.
Preventing overlap with neighbour genes

Come back to the retrieve sequence form.
Perform the same operations as above, but this time, inactivate the option Prevent overlap with neighbour genes.
Compare the size of the sequences with the previous result.

Interpretation of the results

The default size for yeast upstream sequences is 800bp. The reason for this choice is that 99% of the regulatory elements found in TRANSFAC (the Transcription Factor Database) are located between -1 and -800, relative to the start codon.
However, the median value for the distance between a yeast gene and its upstream neighbour is about 450 bp. Thus, with a distance of 800bp, most upstream regions would include a fragment of coding sequence from the upstream ORF, which might bias subsequent analyses.
The situation is even worse in bacteria, due to the organization of genes into operons. In Escherichia coli, 25% of the genes have an upstream neighbour closer than 50bp. When this neighbour is on the same strand as the gene, it might indicate that they belong to the same operon.
It is thus usually preferable to prevent overlap with neighbour genes.

Retrieving sequences for all the genes of a genome

As an illustration fo the retrieval of all sequences, we will now use the same program to retrieve sequences on the other side of the genes : downstream sequences.

Open the retrieve sequence form by clicking retrieve sequence in the left frame.
Make sure that the selected organism is Saccharomyces cerevisiae
Besides the label Genes, select "all".
Do not enter anything in the "Gene" box (it would be ignored anyway).
Select "downstream" as Sequence type.
In the From box, type "1".
In the to box, type "50".
In the pop-up menu Sequence label, select "gene identifier + name".
Click GO

With the above parameters, you retrieved the 50bp located downstream of the stop codon, for each yeast gene. This sequence sets thus mostly contains the 3' untranslated region (UTR). In yeast, the 3' UTR is involved in various functions such as termination of transcription, RNA maturation (cleavage and poly-adenylation), mRNA stability, and translational efficiency. The analysis of these sequences allows to detect some signals involved in such functions (see van Helden, J., Olmo, M. & Perez-Ortin, J. E. (2000) for an application).

Reference position
The reference position (coordinate 0) depends on the type of sequences to be retrieved:
For upstream sequences:
     UPSTREAM         ORF
                    +--------> 
                    |start
--------------------------------
...  T  C  A  A  G  A  T  G ...
    -5 -4 -3 -2 -1  0  1  2
The reference position (coordinate 0) is the ORF start, i.e. the first nucleotide of the start codon.
The coordinate 0 corresponds to the first nucleotide of the coding sequences, i.e. the first letter of the start codon.
Negative coordinates are used to indicate sequences located upstream the start codon, i.e. in the 5'end flank of the gene (e.g. from -1 to -800 for the yeast).
Negative coordinates are indicate 5' non-coding flank of the gene, and positive coordinates the 5' side of the coding sequence.
On the contrary, when you select downstream sequences:
      ORF        DOWNSTREAM
  ...------>
        stop
--------------------------------
...  T  A  G  A  C  G  T ...
    -2 -1  0  1  2  3  4
The reference point (coordinate 0) is the ORF end, i.e. the last letter of the stop codon.
Positive coordinates are used to specify sequences located downstream the stop codon, i.e. the (non-coding) 3'end flank of the gene.
Negative coordinates can be specified to retrieve the (coding) 3' side of the ORF.
Exercises

On the basis of the above information, extract all start codons for a given organism (say Saccharomyces cerevisiae). Check that the codons correspond to the expectation (ATG).

Extract all the stop codons for the same organism. Check that the stop codons correspond to the expectation (TAA, TAG or TGA).

You can now come back to the tutorial main page and follow the next tutorials.

For suggestions please post an issue on GitHub or contact the