1. Home
  2. Docs
  3. Introduction to NGS analysis
  4. Variant discovery
  5. Variant discovery on bowtie data using GATK

Variant discovery on bowtie data using GATK

To run GATK we must first symbolically link to the jar files which run various various programs required. Before then indexing the genome fasta file and running the GATK haplotype caller.

module load java

ln -s /software/genomics/picard-1.45/bin/CreateSequenceDictionary.jar

java -jar CreateSequenceDictionary.jar R= chr17.fa O= chr17.dict

java -jar GenomeAnalysisTK.jar \
-R chr17.fa \
-T HaplotypeCaller \
-I bowtie.sorted.bam \
-L chr17:41,196,311-41,277,499 \
-o bowtie_snps.vcf

However… we get an error.

##### ERROR MESSAGE: SAM/BAM file bowtie.sorted.bam is malformed: SAM file doesn't have any read groups defined in the header.  The GATK no longer supports SAM files without read groups

Unfortunately, the later versions of GATK has tightened the compliance of each input file to the defined file formats.  In our case, the read group information is missing from the SAM file.  GATK detects our file has an incorrect/incomplete header, so we need to fix this using picard tools.

ln -s /software/genomics/picard-1.45/bin/AddOrReplaceReadGroups.jar

java -jar AddOrReplaceReadGroups.jar I=bowtie.sorted.bam O=bowtie_final.bam SORT_ORDER=coordinate RGPU=na RGID=1 RGLB=input RGPL=Illumina RGSM=Company CREATE_INDEX=True


We can then go on and run GATK as before, but using the revised bam file.

java -jar GenomeAnalysisTK.jar \
-R chr17.fa \
-T HaplotypeCaller \
-I bowtie_final.bam \
-L chr17:41,196,311-41,277,499 \
-o bowtie_snps.vcf

Explanation of parameters:

-R Reference genome
-T Select which tools to run
-I Input file
-O Output file
-L Limit analysis to defined region

 Example Output:


Check the number of variants identified

grep -v "^#" bowtie_snps.vcf | wc -l

Explanation of parameters:

-v Inverse results – return lines not matching
“^#”  Find lines starting with a # character
| wc Pipe into word count