Zebra-Llama (v0.2) is a specialized version of the Llama-3.1-8b-instruct model, fine-tuned with data specific to the rare disease Ehlers-Danlos Syndrome (EDS) - a rare connective tissue disorder. We utilized textual information from over 4,000 EDS papers from PubMed, more than 8,000 Reddit posts about EDS, and over 5,000 posts from the Inspire forum to gather real-world concerns/questions related to EDS, which were used to fine-tune the model. As a result, this model is adept at providing accurate responses to questions regarding EDS.
The model is trained using a specialized approach called "context-aware training," where we provided context for each question from a custom vector database during the training phase. This approach enabled the model to demonstrate high precision and recall during the inference phase when utilizing the RAG context. Additionally, the model showed a higher likelihood of generating correct citations compared to the base model.
Here is the Jupyter Notebook Demo for Zebra-Llama.
Here is the API for the RAG knowledge base that we built for rare diseases, currently focussing on EDS.
https://huggingface.co/zebraLLAMA/zebra-Llama-v0.2
Refer to config file to know the training parameters
We have also provided the training script that was used to fine-tune the Llama-3.1-8B-Instruct model
@misc{soman2024zebrallamacontextawarelargelanguage,
title={Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge},
author={Karthik Soman and Andrew Langdon and Catalina Villouta and Chinmay Agrawal and Lashaw Salta and Braian Peetoom and Gianmarco Bellucci and Orion J Buske},
year={2024},
eprint={2411.02657},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02657},
}