-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approximation shortcut #107
Conversation
src/lofreq/snpcaller.c
Outdated
for (int i = 0; i < num_err_probs; ++i) { | ||
mu += err_probs[i]; | ||
} | ||
const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a gsl_cdf_poisson_Q
for calculating the survival function of the Poisson that we may use instead.
src/lofreq/snpcaller.c
Outdated
mu += err_probs[i]; | ||
} | ||
const long double poibin_approximation = 1 - gsl_cdf_poisson_P(max_noncons_count - 1, mu); | ||
if (poibin_approximation > sig_level + 0.01) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cutoff of 0.01
is rather loose and should probably also be a function of n
, as the error decreases with increasing sample count.
Hello again! Just checking in. I have ran LoFreq on a set of different size bam files and the results are below. Diffing between the files shows identical variant calls. I have also increased the cutoff for the p-value (was
|
Hi Bryce, this is super cool! I have a couple of questions: the file size is only a proxy for coverage ( |
This works perfectly on all high coverage datasets I have access to (all viral samples)! I'll add this as default into to the build process |
Sorry, one question. The returned and thus change
accordingly. Correct? Just double checking... |
Hello!
This is a PR for an additional heuristic in the LoFreq algorithm. Many times, the Poisson Binomial calculation either yields numbers that are extremely high p-values extremely low p-values. Therefore, we can use an approximation to cut out performing an exact calculation when the p-value is sufficiently above the significance level. By specifying some threshold above the significance level, we can be extremely confident that the approximation shortcut will never skip exact calculation on columns which have significant p-values.
We use a Poisson approximation to the binomial distribution where the mean is the sum of the probabilities and the variable remains the same (most frequent variant count). There are indeed other approximations that we plan on trying, such as the refined normal approximation (see section 3 of this pub for a brief explanation of the approximations). To calculate the Poisson CDF, we use the GNU scientific library. I have not added a
--with-gsl
flag to the configure (I do not understand.m4
wizardry yet 🙂), so depending on where it is installed for users, they may need to add the location of thelibopenblas.so.0
shared library to theirLD_LIBRARY_PATH
environment variable in order to run.