RSAT - retrieve-variation-seq manual

NAME

retrieve-variation-seq

VERSION

2.0

DESCRIPTION

Given a set of set of variants in varBed format (see convert-variations), list of dbSNP IDs or genomic coordinates in bed format, retrieve the corresponding variants and their flanking sequences, in order to scan them with the tool variation-scan.

AUTHORS

Walter Santana-Garcia
Jacques van Helden
Alejandra Medina-Rivera

CATEGORY

Genetic variations

INPUT DATA

Genomic coordinate file

A genomic coordinate file in bed format. The program only takes into account the 3 first columns of the bed file, which specify the genomic coordinates.

Note (from Jacques van Helden): the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), but the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end.

Example of bed file

 chr1   3473041 3473370
 chr1   4380371 4380650
 chr1   4845581 4845781
 chr1   4845801 4846260

The definition of the BED format is provided on the UCSC Genome Browser web site (http://genome.ucsc.edu/FAQ/FAQformat#format1).

This program only takes into account the 3 first columns, which specify the genomic coordinates.

1. chrom

The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2. chromStart

The starting position of the feature in the chromosome or scaffold. For RSAT programs, the first base in a chromosome is numbered 1 (this differs from the UCSC-specific zero-based notation for the start).

Note from Jacques van Helden: the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), and the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end. We find this representation completely counter-intuitive, and we herefore decided to adopt a "normal" convention, where:

start and end position represent the first and last positions included in the region of interest.
start and end positions are provided in one-based notation (first base of a chromosome or contig has coordinate 1).
3. chromEnd

The ending position of the feature in the chromosome or scaffold.

Variation file

A variation file in varBed format, see convert-variations.

Variation ID list

A list of dbSNP IDs.

OUTPUT FORMAT

A tab delimited file with the following column content.

1. chrom

The name of the chromosome (e.g. 1, X, 8...)

2. chromStart

The starting position of the feature in the chromosome

3. chromEnd

The ending position of the feature in the chromosome

4. chromStrand

The strand of the feature in the chromosome

5. variation id

ID of the variation

6. SO term

SO Term of the the variation

7. ref variant

Allele of the variation in the reference sequence

8. variant

Allele of the variation in the sequence

9. allele frequency

Allele frequency

10. sequence

Sequence of the current variant, flanked by a user-specified neighbouring region

OPTIONS

Organism

Name of the genome organism where the flanking sequences will be retrieved.

Input a list of dbSNP variation IDs (rsID), a set of variants in varBed format, or genomic regions in bed format

Set of variants in varBed format (see convert-variations), list of dbSNP IDs or genomic coordinates in bed format that will be used to retrieve the corresponding variants and their flanking sequences.
The data can be provided either as text or as a file.

Input format

Format of the current input data (varBed, bed or id).

Length of flanking sequence on each side of the variant

Length of the flanking sequences that will be retrieved around the variants.

CONTACT

For further inquiries, please contact Jacques van Helden (Jacques.van-Helden@univ-amu.fr) or Ask a question to the RSAT team !