1. Home
  2. Docs
  3. Introduction to NGS analysis
  4. Summary of file formats

Summary of file formats

fa / fasta & fai

.fa and .fasta files contain text in the classic format for representing nucleotide sequences:

>chr17

AAGCTTCTCACCCTGTTCCTGCATAGATAATTGCATGACAATTGCCTTGT

CCCTGCTGAATGTGCTCTGGGGTCTCTGGGGTCTCACCCACGACCAACTC

CCTGGGCCTGGCACCAGGGAGCTTAACAAACATCTGTCCAGCGAATACCT

The fai is an index file which accompanies the fa/fasta file.

fq / fastq

A fq/fastq file contained sequence information along with quality information about the sequence.  It is the typical non-propietary output from a next generation sequencer.

A fastq file normally uses four lines per sequence.

@chr17:95885:F:237/1

GCGAACACATCCATGTGCCGGGAGGATGGTGCACCCCAACTCCACAAGGACCCTTCCAGACCTCACTCCCTGGGTGCCGTCATGAGAGCC

+

@<@?DDDD>FF:DAF9FFFCAGF<F3AAFD>2ACEF?CFC@?;FB:?@?;D@>86';EE;AE376?########################

 

Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).

Line 2 is the raw sequence letters.

Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.

Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

sam

SAM stands for Sequence Alignment/Map format.  The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns.

@HD    VN:1.0 SO:unsorted

@SQ    SN:chr17      LN:81195210

@PG    ID:Bowtie     VN:0.12.7     CL:"bowtie -S hg19_chr17 reads.fq bowtie.sam"

chr17:95885:F:237/1   0      chr17  41292204      255    90M    *      0      0       GCGAACACATCCATGTGCCGGGAGGATGGTGCACCCCAACTCCACAAGGACCCTTCCAGACCTCACTCCCTGGGTGCCGTCATGAGAGCC    @<@?DDDD>FF:DAF9FFFCAGF<F3AAFD>2ACEF?CFC@?;FB:?@?;D@>86';EE;AE376?########################     XA:i:0 MD:Z:66C1T0G2T0A1C0T1T4C0T1G0C0T0G0 NM:i:14

A full explanation of the format and the data in each column can be found at:

http://samtools.sourceforge.net/SAM1.pdf

bam & bam.bai

The bam format is a BGZF compressed version of a sam file.  The bam.bai file is an index file of the compressed index speed access to the compressed data.  Again full details of the format can be found at:

http://samtools.sourceforge.net/SAM1.pdf

vcf & vcf.idx

VCF is a text file format containing information about variant calls. It contains meta-information lines, a header line, and then data lines, each containing information about a position in the genome.  There is an option whether to contain genotype information on samples for each position or not.

The vcf.idx file is an index file to enable faster searching of the information by certain programs.

##fileformat=VCFv4.1

...

##contig=<ID=chr17,length=81195210>

##reference=file:///home/pi/ngs/chr17.fa

#CHROM POS    ID     REF    ALT    QUAL   FILTER INFO   FORMAT 1

chr17  41201130      .      A      G      4167.77 .       AC=2;AF=1.00;AN=2;DP=110;Dels=0.00;FS=0.000;HaplotypeScore=2.8933;MLEAC=2;MLEAF=1.00;MQ=36.72;MQ0=0;QD=37.89       GT:AD:DP:GQ:PL       1/1:0,109:109:99:4196,325,0

chr17  41201198      .      C      T      3911.77 .       AC=2;AF=1.00;AN=2;DP=102;Dels=0.00;FS=0.000;HaplotypeScore=2.3113;MLEAC=2;MLEAF=1.00;MQ=36.80;MQ0=0;QD=38.35       GT:AD:DP:GQ:PL       1/1:0,101:101:99:3940,304,0

More information available online at:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

ebwt

Genome index files for bowtie are labeled with a ebwt suffix.  These files are formed from a reference sequence and enable faster searching of reads against a reference genome.

fa.amb / fa.ann / fa.bwt / fa.pac

There are index files created for bwa mapping.  Their content is as follows:

amb is a text file, to record appearance of N (or other non-ATGC) in the ref fasta.

.ann is a text file, to record ref sequences, name, length, etc.

.bwt is a binary, the Burrows-Wheeler transformed sequence.

.pac is a binary, packaged sequence (four base pairs encode one byte).

.sa is a binary, suffix array index.