RSA-tools - Tutorials - position-analysis
Contents
Introduction
Position-analysis calculates the positional distribution of each oligonucleotide in a set of sequences. It also calculates the chi2 statistics by comparing the observed and expected positional distribution. Expected positional distribution is calculated according to a homogeneous model, i.e. by considering that the probability for the oligonucleotide to be found at any position is constant.
Position-analysis is typically useful for detecting biological signals which occupy a specific position relative to some reference position. In the original paper, it was used to detect termination and poly-adenylylation signals in yeast downstream sequences. For details, see :
van Helden, J., Olmo, M. & Perez-Ortin, J. E. (2000). Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28(4), 1000-1010. Pubmed 10648794
Examples of utilization
Single nucleotide profiles nearby the start codon
As a short test, we will calculate single nucleotide distributions in the immediate neighbourhood of the start codon, for all yeast genes.
- Retrieve upstream sequences from -20 to +11 for all the yeast genes (as you have seen in the tutorial on sequence retrieval). This includes 20bp upstream and 12 bp coding sequences (i.e. the 4 first codons).
- Once you get the result, go to the bottom of the sequences. You should see a list of buttons, which allow you to send these sequences as input for another program. Click the button labelled position analysis.
- In the form, select the following options :
- Oligonucleotide size: 1
- Count on: single strand
- Class grouping interval: 1
- Sort patterns: inactivated
- Origin: -12 (this will place the origin at the first nucleotide of the start codon)
- Output: display (since sequences are small, the execution time should be reasonable)
- Leave all other options unchanged and click GO.
Interpreting the results
The first columns indicate the pattern (in this case single nucleotides), the total number of occurrences, and the chi2 value. These values are very high, indicating that each nucleotide is significantly biased in position in the region including the start codon.You can now inspect the position distributions. The ATG codon appears very clearly, at positions 0 (A), 1 (T) and 2 (G). You cn notic that there are a very few exceptions, i.e. genes for which the annotated start codon is not ATG.
Interestingly, some nucleotides are also biased at pre-start positions, for example at position -3, which is enriched in A, whilst C and T seem to be avoided.
Additional exercises
Trinucleotide frequencies around the start codon
As an additional exercise, analyse the positional distributions of trinucleotides in the -30:+2 range around the start codon, with a class interval of 1. Adapt the origin to associate position 0 to the first nucleotide of the start codon.When you have the result, analyze the trinucleotides which have a peak just before the start codon (position -1). What do they have in common ? How do you interpret this ?
Perform the same analysis with upstream sequences without the start codon (-30:-1). Some trinucleotides are avoided or favoured in the immediate vicinity of the start codon. Analyze their sequence and relate this result to the analysis of single nucleotides you performed before.
Downstream signal detection
The following exercise is an example of realistic size, with a 1.2Mb sequence set. The analysis will take some time, and you should thus use the email output.Retrieve all yeast downstream sequences over 200bp, and analyze the positional profiles of all hexanucleotides. Set the threshold for the chi2 statistics to 50, in order to select significantly biased hexanucleotides. Sort the results by their score (the chi2 value).
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions please post an issue on GitHub or contact the