In the expansive realm of digital healthcare data, individuals encounter significant difficulties in finding accurate, pertinent, and trustworthy information. Conventional search methods often produce results that are overly broad, outdated, or lacking in reliability for making well-informed health-related decisions. The intricate nature of medical terminology exacerbates this challenge, along with the ever-evolving landscape of medical knowledge, where new insights and guidelines continually emerge. "Healthcare Mining" aims to bridge these critical gaps by harnessing the power of artificial intelligence (AI) and machine learning (ML) technologies.
Our objective is to develop a system capable of not only comprehending users' nuanced queries but also ensuring the accuracy, credibility, and timeliness of the healthcare information it provides.
Our system architecture is streamlined, focusing on the frontend application developed with Streamlit, which directly interfaces with various LLM APIs and models including OpenAI, Claude, Gemini, Mistral, and LLama2.
- Streamlit: Facilitates the creation of an interactive user interface, allowing users to input queries, upload files, and receive information in real-time.
- Direct Calls to LLM: The query embedding generation is performed through direct calls to LLM APIs and models. This approach ensures efficient handling of user requests and leverages different LLM’s powerful natural language processing capabilities.
We have tested the capabilities of our healthcare chatbot among the most popular LLMs in the market at the moment such as OpenAI, Gemini, Llama, Mistral, and Claude. With each of our responses evaluated, we have figured out the best performing LLM for our chatbot which gives out the most accurate results.
We have integrated Trulens as our evaluation metric by coding specific feedback functions to measure the responses of our chatbot. This was done across different chains as well as different LLMs to ensure our chatbot performs to the best of its capabilities.
- Web Scraping: Used the Beautiful Soup library in Python along with the requests library to scrape data from WebMD and Mayo Clinic.
- Clean HTML Tags
- Strip Whitespace
- Convert Data Types
- Standardize Values
- Trulens Analysis: Assesses Groundedness, Questions/Answer Relevance, and Question/Context Relevance. This analysis allows us to gauge how well the system’s responses are rooted in the factual content of our database and their relevance to the user’s queries.
- Python 3.8 or higher
- Streamlit
- Beautiful Soup (for web scrapping)
- Requests Library (for web scrapping)
- Clone the repository:
git clone https://github.com/Sujithrt/healthcare_mining.git
- Install dependencies:
pip install -r requirements.txt
- Obtain the API keys for OpenAI, Claude, and Gemini and save them in a
streamlit/secrets.tomlfile - Create an account in Datastax Astra and create a database. Obtain the database ID and Token and save them in the
streamlit/secrets.tomlfile - Download the pre-trained models (Mistral and LLama2) from the below links into the models folder:
- Run the application:
streamlit run OpenAI.py