Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximation shortcut #107

Merged
merged 8 commits into from
Jan 6, 2021
Merged

Approximation shortcut #107

merged 8 commits into from
Jan 6, 2021

Conversation

bkille
Copy link
Contributor

@bkille bkille commented Nov 30, 2020

Hello!

This is a PR for an additional heuristic in the LoFreq algorithm. Many times, the Poisson Binomial calculation either yields numbers that are extremely high p-values extremely low p-values. Therefore, we can use an approximation to cut out performing an exact calculation when the p-value is sufficiently above the significance level. By specifying some threshold above the significance level, we can be extremely confident that the approximation shortcut will never skip exact calculation on columns which have significant p-values.

We use a Poisson approximation to the binomial distribution where the mean is the sum of the probabilities and the variable remains the same (most frequent variant count). There are indeed other approximations that we plan on trying, such as the refined normal approximation (see section 3 of this pub for a brief explanation of the approximations). To calculate the Poisson CDF, we use the GNU scientific library. I have not added a --with-gsl flag to the configure (I do not understand .m4 wizardry yet 🙂), so depending on where it is installed for users, they may need to add the location of the libopenblas.so.0 shared library to their LD_LIBRARY_PATH environment variable in order to run.

for (int i = 0; i < num_err_probs; ++i) {
mu += err_probs[i];
}
const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a gsl_cdf_poisson_Q for calculating the survival function of the Poisson that we may use instead.

mu += err_probs[i];
}
const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu);
if (poibin_approximation > sig_level + 0.01) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cutoff of 0.01 is rather loose and should probably also be a function of n, as the error decreases with increasing sample count.

@bkille
Copy link
Contributor Author

bkille commented Dec 14, 2020

Hello again! Just checking in. I have ran LoFreq on a set of different size bam files and the results are below. Diffing between the files shows identical variant calls. I have also increased the cutoff for the p-value (was sigma + 0.01 now sigma + 0.05 ) since it has no noticeable effect on the runtime. The optimization has little effect in the smaller files, as n and k are fairly small, our optimization doesn't even run if the n<100. Please let me know if you have any questions!

File Size Original Runtime (seconds) Improved Runtime (seconds) Speedup
41K 0.03 0.03 1.0
415K 0.24 0.24 1.0
644K 1.69 1.69 1.0
1M 6.33 6.3 1.0
1M 2.61 2.59 1.01
2M 11.62 11.19 1.04
4M 21.82 22.22 0.98
6M 7.0 6.95 1.01
10M 34.22 30.08 1.14
16M 88.08 71.72 1.23
25M 186.29 137.73 1.35
39M 303.53 213.41 1.42
58M 52.69 51.25 1.03
97M 977.3 567.82 1.72
151M 1992.68 1033.87 1.93
237M 3507.36 1583.38 2.22
368M 7425.59 2914.67 2.55
587M 17431.13 5843.03 2.98
935M 52686.3 16059.07 3.28
1G 118481.47 52941.52 2.24
2G 200245.87 43222.88 4.63
3G 645137.02 279217.28 2.31
25G 1495992.48 400483.63 3.74

@andreas-wilm
Copy link
Contributor

Hi Bryce, this is super cool! I have a couple of questions: the file size is only a proxy for coverage (n). Would you be able to generate the same table with coverage added? To that end you can just compute the average on the output of samtools depth -a your.bam . I'll get to testing this on some real date now

@andreas-wilm
Copy link
Contributor

This works perfectly on all high coverage datasets I have access to (all viral samples)! I'll add this as default into to the build process

@andreas-wilm
Copy link
Contributor

Sorry, one question. The returned poibin_approximation is a p-value, right? So I guess we should also use Bonferroni correction as further down below in the code:
if (pvalue * (double)bonf_factor > sig_level) {goto free_and_exit;}

and thus change

if (poibin_approximation > sig_level + 0.05) {goto free_and_exit;}

accordingly. Correct? Just double checking...

@andreas-wilm andreas-wilm merged commit 3ce3d3a into CSB5:master Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants