I am attempting to find a way to better describe the importance of sequencing depth when performing DNA methylation studies.
Quite often, the measure of interest when performing bisulfite sequencing is the ‘percentage methylated’ value (somewhere 0-100%) for a specific CG, CHG, or CHH site in the genome. This is often calculated as the number of reads showing the site methylated / total depth at the site.
The three contexts of DNA methylation often appear to show very different genome-wide averages depending on the species. For example, in grasses CG is always the highest % with somewhere between 60-90%. CHG comes next with 30-60%. Finally, CHH is often found at a much lower level than the others (1-4%).
When performing sequencing, you will often see methylation values that are spurious if sequencing depth is low. For example, if there is a single read covering a CHH site that happens to be methylated, you will get a proportion of 1/1 equaling 100% methylation at that site. Given the general expectation of this context, it is highly unlikely that this is a worthwhile estimate of methylation at this site. As evidence low coverage will often create a number of very specific methylation percentages in your dataset:
100% (1/1), 66.6% (2/3), 50% (1/2), 33/3% (1/3), 25% (1/4), etc etc.
For example, here is a simulation of methylation calculations across 20 independent sequencing libraries. The true methylation percentage has been set at 2%. The X axis indicates the number of reads (depth) covering this individual site and the y axis provides an up-to-date methylation percentage calculation at that given depth:
As you can see, there is a major change in percent methylation if even a single read indicates the site is methylated. In fact, there are multiple cases in which even after reaching a depth of 100 reads over the site, there has yet to be a methylated cytosine call, leaving the percentage at 0. In contrast, if you happen to be unlucky enough to pull a methylated read out of your sequencing lane with low coverage, it can take a substantial number of unmethylated calls to start bringing your percent methylation back down towards the true value.
We can also look to see how close we are getting to our true value (the grey line above on each plot) by bouncing some dots around:
Overall, we can see that the majority of our sequencing attempts do eventually lead to similar results. That being a calculation near 2%. However, at individual sites, the coverage often will not be anywhere near 100 reads.
For instance, here is the standing with 10-read coverage in this simulation:
You could have 20 biological replicates, or technical replicates for that matter, and at this coverage level none of the samples are really all that close to reality!
Now, let’s compare this to some simulations of higher true values. Here I have set the real value to be 80%:
We still have some of the jittering around our real value, however we are certainly clear to note that this site would be highly methylated.
Of course, if we looked at the actual differences in percent methylation for both of these examples, the numbers are quite similar, but we need to keep in mind that our possible expectations of methylation are different between these contexts. CG methylation appears to show a highly binary state in real data. It is either ‘on’ or ‘off’ with the overall distribution displaying sites at either ~0% or ~100%. In cases like this, a difference of 1-3% is likely not changing any site from being ‘on’ or ‘off’.
Compare this with expectations of CHH. We are now often expecting that real, biologically important, differences may be occurring in 1-10% range. Often, CHH DMRs are called across samples with changes in methylation going from single-digits to single-digits. All of a sudden, our jitter of 1-3% could be a major source of variation.
If in reality there are biological differences between a couple of methylation percentage points at some CHH sites, we really need to up our sequencing depth expectations. For instance, you don’t really have a chance to even measure your 2% CHH precisely with less than 50 reads covering that site. Our ~10x genome coverage calculations are not going to cut it in this case.
There may be some ways of getting around this from trying to gain power from other nearby sites or by filtering / weighting any calculations directly by sequencing depth. These may get rid of some of our poor values, however it also greatly reduces the number of sites we can assay.
I could be way off base, but right now my mind is saying:
Low coverage can tell you what CG is doing, but can’t tell you anything regarding CHH. If it turns out nuanced changes to CG are important, the same depth requirement holds true.