It seemed that after two rounds of bootstrapping we reached convergence, i.e. the correct quality scores were not changed much after the second round.
Variant quality score distributions:
However, bootstrapping BQSR is meant to mimic the existence of a dbSNP database, in humans genomics for example, that reflect confident variants previously described. That got me to thinking that perhaps a more stringent variant filter would be better for the first round of BQSR.
So I ran GATK VariantFiltration with a QD < 20.0 on both SNPs and INDELs before BQSR. It filtered 91793 SNPs vs. 35938 with QD<2.0 and 14575 INDELs vs 2083.
The recalibration plots seem to not display much change after using this stringent filtering.
So after a discussion with the GATK devs here, it turns out that "it is because the extra variants are not doing much to "mask out" the errors" and we can stick with the base filtration settings.