#### README #### IMPORTANT: For data queries you can use Ensembl's data mining tool, Biomart http://www.ensembl.org/biomart/martview/. Please send comments or questions to helpdesk@ensembl.org ----------------------------------------- Ensembl Multi Format (EMF) FLATFILE DUMPS ----------------------------------------- This directory contains EMF flatfile dumps for resequencing data. To ease download there is an EMF file for each chromosome. All files are then compacted with GNU Zip for storage efficiency. In Ensembl, there are EMF files for different types of data: 1. Whole-genome multiple alignments created by the Ensembl Comparative Genomics team (compara). 2. Gene-based multiple alignments created by the Ensembl Comparative Genomics team (gene_alignment). 3. Alignments of resequencing data from individuals or strains to the reference genome assembly. (These are created by the Ensembl variation team). The file format is very similar for all data types. This README represents verion 1.0 of the EMF specification for resequencing data. ---------- FILE NAMES ---------- The files are consistently named following this pattern: .....emf.gz EXAMPLE EMF resequencing data file names: Mus_musculus.NCBIM36.43.resequencing.chromosome.Y.emf.gz Homo_sapiens.NCBI36.43.resequencing.chromosome.X.emf.gz Rattus_norvegicus.RGSC3.4.43.resequencing.chromosome.7.emf.gz ----------- FILE FORMAT ----------- -- Example -- ##FORMAT (resequencing) ##DATE Fri Nov 6 11:38:11 2009 ##RELEASE 57 58 SEQ mouse reference 17 780000 790000 1 SEQ mouse 129S1/SvJ WGS SEQ mouse DBA WGS SCORE aligned 129S1/SvJ reads SCORE aligned DBA reads DATA A A A 2 1 T A T 2 1 C C C 2 1 G G ~ 1 0 C C ~ 1 0 // -- Explanation -- There are three sections to the file: - the file header lines, starting with ## - the column header lines - the data block Comment character is "#" and must be the first character in the line. -- The header lines: ##FORMAT (resequencing) ##DATE dump_date ##RELEASE ensembl_release_number (may contain multiple release numbers) -- The column header lines: These lines start with "SEQ" and "SCORE". Each line in this section explains, in order, the columns of data in the data block. The number of lines in this section match the number of columns of data so there may be multiple SEQ and SCORE lines. The information is as follows: SEQ organism individual chr sequence_start sequence_stop strand (chr_length=sequence_length) SEQ organism individual sequence_source (WGS, etc.) SCORE score_type (may also be confidence score) -- The data block: This section starts with the line only containing the word "DATA". The number of columns here corresponds to the number of lines in the column header section. Spaces between the columns are optional. The data block ends with //. An insertion or deletion is represented by "-". Positions in the genome which have no resequencing coverage are denoted by "~". Lowercase characters are used for masked sequence. Ambiguity codes are used at heterozygous base pairs. All coordinates are inclusive coordinates.