RSAT - supported sequence formats
RSAT are dedicted to the analysis of regulatory sequences and thus only deals with DNA sequences. The following formats are supported.
- raw
The text contains a single DNA sequence without any comment or other text. The sequence should only contain the letters corresponding to nucleotides (A, C, G, T). Uppercases and lowercases are onsidered equivalent. Tabs (\t), blank spaces and newline characters (\n) are accepted (they will be automatically removed before analysis).
Example of raw format (a single 60 bp sequences)
GCGGTGCCCGGCCCAGCCACATATATATAGGTGTGTGCCA CTCCCGGCCCCGGTATTAGCmulti
This is a variant of the raw format, allowing to include several sequences within the same text. As in the raw format, there are neither comments nor sequence identifiers. Tabs and white spaces are ignored. The difference is that each new line is considered to contain a distinct sequence.
Example of multi format (two distinct sequences, 60 bp each)
GCGGTGCCCGGCCCAGCCACATATATATAGGTGTGTGCCACTCCCGGCCCCGGTATTAGC CCCTTCCAGTTTCTTTTATTCTTTGCTGTTTCGAAGAATCACACCATCAATGAATAAATCtab
Tab-delimited text file with at least two columns.
Each row contains information about one sequence.
- The first column contains the sequence ID.
- The second column contains the sequence, without any space.
- Additional columns are ignored.
fasta
This format was originally defined for the program FastA, and has been used by a wide variety of other programs (FastA, BLAST, gibbs sampler, ...).
- A raw starting with a ">" indicates that a new sequence will begin on the next row. The symbol ">" is immediately followed by the sequence identifier, a single word.
- After the ID, additional text on the same line is considered as comment (it is ignored in the sequence analysis).
- The sequence starts on the next line.
- White spaces, tabulations and newline characters can be inserted within the sequence, and will be ignored for analysis.
- The sequence is read until the end of file is reached or another line is found which begins with a ">".
- Several sequence can be concatenated within the same text.
Example of FastA format (2 sequences, 60 bp each)
>GAT1_up GCGGTGCCCGGCCCAGCCACATATATATAGGTGTGTGCCA CTCCCGGCCCCGGTATTAGC >PUT4_up CCCTTCCAGTTTCTTTTATTCTTTGCTGTTTCGAAGAATC ACACCATCAATGAATAAATCwconsensus
This is the format defined by Jerry Hertz for his programs patser, consensus and wconsensus. Lines beginning with a semiolon ";" or a "#" are considered as comments. The first word of the first non-comment sequence is the sequence identifier. The sequence follows, embraced in backslashes (\).
Example of Wconsensus format (2 sequences, 60 bp each)
;sequence of the region upstream from gat1 ;Locus GAT1 ;ORF YFL021W coord: VI 95964 97496 ;upstream region size: 60 ;upstream region coord: VI 95904 95963 GAT1_up \GCGGTGCCCGGCCCAGCCACATATATATAGGTGTGTGCCACTCCCGGCCCCGGTATTAGC\ ;sequence of the region upstream from put4 ;Locus PUT4 ;ORF YOR348C coord: XV 988773 986890 ;upstream region size: 60 ;upstream region coord: XV 988833 988774 PUT4_up \CCCTTCCAGTTTCTTTTATTCTTTGCTGTTTCGAAGAATCACACCATCAATGAATAAATC\IG
IntelliGenetics format. All lines beginning with a semicolon ";" are considered as comments. The first non-comment line contains the sequence identifier (a single word without spaces). The sequence follows on the next lines. It can include spaces, tabs or newlines, that will be ignored for sequence analysis. The end of one sequence is indicated by termination character: 1 for linear, 2 for circular sequences. Several sequences can be concatenated within the same text.
Example of IG format
;sequence of the region upstream from gat1 ;Locus GAT1 ;ORF YFL021W coord: VI 95964 97496 ;upstream region size: 60 ;upstream region coord: VI 95904 95963 GAT1_up GCGGTGCCCGGCCCAGCCACATATATATAGGTGTGTGCCA CTCCCGGCCCCGGTATTAGC1 ;sequence of the region upstream from put4 ;Locus PUT4 ;ORF YOR348C coord: XV 988773 986890 ;upstream region size: 60 ;upstream region coord: XV 988833 988774 PUT4_up CCCTTCCAGTTTCTTTTATTCTTTGCTGTTTCGAAGAATC ACACCATCAATGAATAAATC1Mask option
Mask lower- or uppercases, respecively, i.e. replace selected cases by N characters.
This option is useful with some sequence formats where lowercases indicated "soft masked" fragments. Soft masking is used by some genome centers to indicate exonic sequences.
Remarks
- Most formats (except raw) support multiple sequences within the same text. This is convenient for analyzing a family of sequences (oligo analysis, pattern search).
- The program convert sequence performs interconversions between these sequence formats.