To run GATK we must first symbolically link to the jar files which run various various programs required. Before then indexing the genome fasta file and running the GATK haplotype caller.
module load java
ln -s /software/genomics/picard-1.45/bin/CreateSequenceDictionary.jar
java -jar CreateSequenceDictionary.jar R= chr17.fa O= chr17.dict
java -jar GenomeAnalysisTK.jar \
-R chr17.fa \
-T HaplotypeCaller \
-I bowtie.sorted.bam \
-L chr17:41,196,311-41,277,499 \
-o bowtie_snps.vcf
However… we get an error.
##### ERROR MESSAGE: SAM/BAM file bowtie.sorted.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
Unfortunately, the later versions of GATK has tightened the compliance of each input file to the defined file formats. In our case, the read group information is missing from the SAM file. GATK detects our file has an incorrect/incomplete header, so we need to fix this using picard tools.
ln -s /software/genomics/picard-1.45/bin/AddOrReplaceReadGroups.jar
java -jar AddOrReplaceReadGroups.jar I=bowtie.sorted.bam O=bowtie_final.bam SORT_ORDER=coordinate RGPU=na RGID=1 RGLB=input RGPL=Illumina RGSM=Company CREATE_INDEX=True
We can then go on and run GATK as before, but using the revised bam file.
java -jar GenomeAnalysisTK.jar \
-R chr17.fa \
-T HaplotypeCaller \
-I bowtie_final.bam \
-L chr17:41,196,311-41,277,499 \
-o bowtie_snps.vcf
Explanation of parameters:
-R |
Reference genome |
-T |
Select which tools to run |
-I |
Input file |
-O |
Output file |
-L |
Limit analysis to defined region |
Example Output:

Check the number of variants identified
grep -v "^#" bowtie_snps.vcf | wc -l
Explanation of parameters:
-v | Inverse results – return lines not matching |
“^#” | Find lines starting with a # character |
| wc | Pipe into word count |
Output:
20