"Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data."
So basically we need to correct these errors.
If we have lots of lines sequenced we can use these to test if the variants are true or not. If we don't (currently the case) we can bootstrap the analysis and test vs are own variant calls.
"Here's how it works:
First do an initial round of SNP calling on your original, unrecalibrated data.
Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."
So let's start with calling variants:
Calling variants took about 8hrs.
The next step is to select the highest confidence SNPs to pass down to score recalibrator. Info here.
Extract the SNPs from the vcf, ...and apply a filter to these SNPs, I'm just using some recommended (?) values in the above tutorial. Info about the filter variables. There are a lot of WARN statements but according to this its ok, as some positions don't have the relevant filtering score.Do the same for indels:
Ok. So now that we have a set of filtered SNPs and Indels we can use it in BQSR.
So now I use those calibrated bases to call the haplotype:
Here is a final script that does one iteration of BQSR bootstrapping. It filters SNPs and INDELs then runs two rounds of BQSR, generates plots and calls the haplotype with the calibration data. It then filters SNPs and INDELs from this new haplotype.
After one round of BQSR was performed a second round is done and the plots are compared. If there is not much change between the AFTER plots then we have reached convergence and no more bootstrapping is needed.
Second round of BQSR
Finally, we filter out the variants from the last Haplotype caller run and merge them.