Following the best practices described on the GATK tutorial here.
First step after aligning is to mark duplicate reads:
From picard help: "duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR"
So I encountered an error.
Based on some answers here, I think I need the re-run bwa with the -M option.
"The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard’s markDuplicates does not work with split alignments. One may consider to use option -M to flag shorter split hits as secondary."
So after re-aligning with that option, Mark duplicates seems to work. Took 15min.
Illa-Berenguer et al. (2015) proceed by indel re-alignment by BAQ.
however it seems that it is not necessary anymore. (here and here)
So the next step seems to be to re-calibrate the bases. This is a way to correct errors of quality scores per base. "Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls."
We could approach this in two ways, 1) to download the 115 genomes and look at variants called there or 2) to bootstrap the Gy genome and build a model from that.
I ran into another error.
It turns out that GATK needs bam and sam files to have read groups defined. It was written plain as day in the tutorial. I just missed it. Some help was found here. Read Groups are tags in the sam that help GATK know when sequences come from the same run etc. and thus might have similar error biases. More info in the SAM file format help.
GATK wants the read group format as such:
So we'll define ours as:
ID is a unique and helpful identifier
SM is the sample name
PL is the platform
LB is the library - all of ours were made at at the same time
PU is the lane - all of our samples were sequenced on one lane
So the final bash script is:
again: the -M is for picard compatibility, -R is our read groups.