This repository provides a Q&A application that allows users to upload a PDF, parse its content, and query it in natural language. By combining the Google Gemini API for embeddings and content generation with ChromaDB for efficient text storage and retrieval, the system provides a seamless way to interact with static documents.
- Features
- Libraries Used
- Architecture
- Installation
- Usage
- Challenges and Pitfalls
- Safeguards for a Commercial Product
- Future Improvements
- PDF Parsing: Extracts text from PDFs, processing page-by-page for modularity.
- Semantic Search: Leverages embeddings to identify relevant document passages based on user queries.
- Dynamic Q&A: Generates answers using Google Gemini API based on user queries and relevant document content.
- Used for generating text embeddings and natural-language responses.
- Enables semantic understanding of the document and queries.
- A vector database for embedding storage and retrieval.
- Offers scalability and fast similarity searches for document querying.
- A robust library for extracting text from PDFs.
- Splits documents into manageable chunks (pages).
- Manages environment variables securely, ensuring API keys are not hardcoded.
The system consists of three main components:
-
PDF Parsing:
- Extracts text from the PDF and organizes it page-by-page using
PyPDF2. - Each page is stored as a document in the vector database (
ChromaDB).
- Extracts text from the PDF and organizes it page-by-page using
-
Semantic Embedding and Storage:
- Text embeddings are generated using Google Gemini API.
- These embeddings are stored in ChromaDB for similarity-based retrieval.
-
Q&A Workflow:
- The user’s query is embedded using the same model and matched against stored embeddings in ChromaDB.
- The most relevant passage is used as context for generating an answer via the Google Gemini API.
- Python 3.8 or higher.
- Google Cloud account with Gemini API access.
- Install the required libraries:
pip install google-generativeai chromadb PyPDF2 python-dotenv
-
Clone the repository:
git clone https://github.com/your-username/pdf-qna.git cd pdf-qna -
Create a
.envfile in the root directory and add your Google API key:GOOGLE_API_KEY=your-google-api-key -
Run the application:
python main.py
- If the database is empty, the system prompts you to upload a PDF:
Enter the path to your PDF: example.pdf - The text is parsed and stored as embeddings in ChromaDB.
-
Ask questions in natural language, such as:
Your question: What is discussed in the introduction? -
Receive an AI-generated answer and the relevant passage:
Answer: The introduction outlines the importance of... Passage: In the introduction, the author discusses... -
Type
exitto quit the program.
-
Handling Large PDFs:
- Large PDFs can overwhelm memory or processing capabilities.
- Solution: Process and embed text page-by-page for modularity.
-
Text Retrieval Accuracy:
- Embedding models may misinterpret queries or retrieve irrelevant passages.
- Solution: Use high-quality embeddings and fine-tune retrieval parameters.
- API Dependency: Relies heavily on Google Gemini API, making it vulnerable to changes in service or pricing.
- Limited Context Window: Only retrieves one passage at a time, which might miss broader context.
If this system were to be developed into a commercial product, the following safeguards would be critical:
- Encrypt PDF uploads and text data stored in databases.
- Ensure compliance with data privacy laws like GDPR and HIPAA for sensitive documents.
- Implement filters to prevent misuse of the system (e.g., inappropriate or malicious queries).
- Restrict API calls to prevent excessive usage or abuse, reducing costs and ensuring system availability.
- Use backup databases and failover mechanisms to ensure reliability in case of server or service downtime.
- Require users to authenticate before using the system to maintain security and track usage.
-
FAISS Integration:
- Use FAISS for faster and more scalable vector searches, improving performance on large datasets.
-
LangChain Framework:
- Streamline interaction logic with LangChain’s prompt engineering and chained workflows.
-
Multi-Passage Retrieval:
- Retrieve multiple relevant passages for more comprehensive answers.
-
Web Interface:
- Build a web app with frameworks like Flask or React for better user experience.
-
Mobile Support:
- Extend the system to support mobile platforms for on-the-go document querying.
This project demonstrates the potential of combining cutting-edge AI tools with efficient storage solutions to unlock static document content. Engineers and developers are encouraged to fork the repository, experiment, and contribute to the system’s growth!