Approximation shortcut #107

bkille · 2020-11-30T00:01:54Z

Hello!

This is a PR for an additional heuristic in the LoFreq algorithm. Many times, the Poisson Binomial calculation either yields numbers that are extremely high p-values extremely low p-values. Therefore, we can use an approximation to cut out performing an exact calculation when the p-value is sufficiently above the significance level. By specifying some threshold above the significance level, we can be extremely confident that the approximation shortcut will never skip exact calculation on columns which have significant p-values.

We use a Poisson approximation to the binomial distribution where the mean is the sum of the probabilities and the variable remains the same (most frequent variant count). There are indeed other approximations that we plan on trying, such as the refined normal approximation (see section 3 of this pub for a brief explanation of the approximations). To calculate the Poisson CDF, we use the GNU scientific library. I have not added a --with-gsl flag to the configure (I do not understand .m4 wizardry yet 🙂), so depending on where it is installed for users, they may need to add the location of the libopenblas.so.0 shared library to their LD_LIBRARY_PATH environment variable in order to run.

…the LD_LIBRARY_PATH

bkille · 2020-11-30T00:03:11Z

src/lofreq/snpcaller.c

+    for (int i = 0; i < num_err_probs; ++i) {
+        mu += err_probs[i];
+    }
+    const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu);


There is a gsl_cdf_poisson_Q for calculating the survival function of the Poisson that we may use instead.

bkille · 2020-11-30T00:04:00Z

src/lofreq/snpcaller.c

+        mu += err_probs[i];
+    }
+    const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu);
+    if (poibin_approximation > sig_level + 0.01) {


The cutoff of 0.01 is rather loose and should probably also be a function of n, as the error decreases with increasing sample count.

bkille · 2020-12-14T00:24:36Z

Hello again! Just checking in. I have ran LoFreq on a set of different size bam files and the results are below. Diffing between the files shows identical variant calls. I have also increased the cutoff for the p-value (was sigma + 0.01 now sigma + 0.05 ) since it has no noticeable effect on the runtime. The optimization has little effect in the smaller files, as n and k are fairly small, our optimization doesn't even run if the n<100. Please let me know if you have any questions!

File Size	Original Runtime (seconds)	Improved Runtime (seconds)	Speedup
41K	0.03	0.03	1.0
415K	0.24	0.24	1.0
644K	1.69	1.69	1.0
1M	6.33	6.3	1.0
1M	2.61	2.59	1.01
2M	11.62	11.19	1.04
4M	21.82	22.22	0.98
6M	7.0	6.95	1.01
10M	34.22	30.08	1.14
16M	88.08	71.72	1.23
25M	186.29	137.73	1.35
39M	303.53	213.41	1.42
58M	52.69	51.25	1.03
97M	977.3	567.82	1.72
151M	1992.68	1033.87	1.93
237M	3507.36	1583.38	2.22
368M	7425.59	2914.67	2.55
587M	17431.13	5843.03	2.98
935M	52686.3	16059.07	3.28
1G	118481.47	52941.52	2.24
2G	200245.87	43222.88	4.63
3G	645137.02	279217.28	2.31
25G	1495992.48	400483.63	3.74

andreas-wilm · 2021-01-06T08:34:41Z

Hi Bryce, this is super cool! I have a couple of questions: the file size is only a proxy for coverage (n). Would you be able to generate the same table with coverage added? To that end you can just compute the average on the output of samtools depth -a your.bam . I'll get to testing this on some real date now

andreas-wilm · 2021-01-06T09:42:38Z

This works perfectly on all high coverage datasets I have access to (all viral samples)! I'll add this as default into to the build process

andreas-wilm · 2021-01-06T09:45:02Z

Sorry, one question. The returned poibin_approximation is a p-value, right? So I guess we should also use Bonferroni correction as further down below in the code:
if (pvalue * (double)bonf_factor > sig_level) {goto free_and_exit;}

and thus change

if (poibin_approximation > sig_level + 0.05) {goto free_and_exit;}

accordingly. Correct? Just double checking...

bkille added 4 commits October 16, 2020 03:14

add flags for gsl

40fc133

approx benchmark

f6d1bee

remove rpath. User's will need to ensure that libopenblas.so.0 is in …

ef128f4

…the LD_LIBRARY_PATH

less changes

2a9f83e

bkille commented Nov 30, 2020

View reviewed changes

bkille added 4 commits December 11, 2020 11:42

add -O3

b2d2762

add minimum n cutoff

ba49787

change cutoffs

980f537

Use static gsl library

d41f10d

andreas-wilm merged commit 3ce3d3a into CSB5:master Jan 6, 2021

schorlton-bugseq mentioned this pull request Mar 1, 2025

Tag a release? #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximation shortcut #107

Approximation shortcut #107

bkille commented Nov 30, 2020

bkille Nov 30, 2020

bkille Nov 30, 2020

bkille commented Dec 14, 2020 •

edited

Loading

andreas-wilm commented Jan 6, 2021

andreas-wilm commented Jan 6, 2021

andreas-wilm commented Jan 6, 2021

Approximation shortcut #107

Approximation shortcut #107

Conversation

bkille commented Nov 30, 2020

bkille Nov 30, 2020

Choose a reason for hiding this comment

bkille Nov 30, 2020

Choose a reason for hiding this comment

bkille commented Dec 14, 2020 • edited Loading

andreas-wilm commented Jan 6, 2021

andreas-wilm commented Jan 6, 2021

andreas-wilm commented Jan 6, 2021

bkille commented Dec 14, 2020 •

edited

Loading