This system provides analytics over General Insurance Company (GIC) premium data (FY24 & FY25, Apr-Oct). It uses Retrieval-Augmented Generation (RAG) to deliver grounded, hallucination free insights.
- Zero Hallucinations: All answers grounded in actual data with citations
- Advanced Analytics: Growth metrics, volatility analysis, risk classifications
- Persistent Storage: ChromaDB vector database with disk persistence
pip install -r requirements.txtEdit .env file and add your Groq API key:
GROQ_API_KEY=your_actual_key_hereGet your free API key at: https://console.groq.com/keys
cd Rag
python document_generator.pyThis creates 49 semantic documents from your insurance data:
- 34 company summaries
- 9 segment overviews
- 4 risk classifications
- 1 industry overview
- 1 growth insights document
streamlit run app.pyThe app will open at http://localhost:8501
Try asking:
- "Which insurers have risky growth?"
- "What is the total industry premium in FY25?"
- "Compare health and motor segments"
- "Which companies are exposed to crop insurance risk?"
- "Show me companies with high YoY growth"
- "What are the key trends in FY25?"
rag star/
├── app.py # Streamlit web interface
├── requirements.txt # Python dependencies
├── .env # API keys (gitignored)
│
├── Rag/ # Core RAG system
│ ├── __init__.py # Package initialization
│ ├── analytics.py # Data analysis logic
│ ├── document_generator.py # Semantic document creation
│ └── rag.py # RAG engine (Groq + ChromaDB)
│
├── data/
│ ├── processed/
│ │ ├── gic_ytd_master_apr_oct.csv # Main dataset
│ │ ├── rag_knowledge_base.csv # Generated documents
│ │ └── rag_metadata.json # Document metadata
│ ├── cleaned/ # Monthly Excel files
│ └── raw/ # Original data
│
├── analysis/
│ └── gic_ytd_premium_analysis_report.ipynb
│
└── chroma_db/ # Vector database (auto-created)
- Computes growth metrics at industry/segment/company levels
- Calculates volatility and risk classifications
- Portfolio concentration analysis
- Company ranking by growth/stability
- Transforms analytics into semantic text chunks
- Creates 5 document types optimized for retrieval
- Saves to CSV with metadata
- Embeddings:
sentence-transformers/all-MiniLM-L6-v2 - Vector DB: ChromaDB with persistent storage
- LLM: Groq Llama 3.3 70B (ultra-fast inference)
- Retrieval: Top-k semantic search
- Generation: Grounded answers with citations
- Streamlit chat interface
- Real-time query processing
- Knowledge base management
- System status monitoring
- Python 3.8+
- 8GB RAM minimum
- Internet connection (for Groq API)
- macOS, Linux, or Windows
- Ask a Question: Type your query in the chat input
- RAG Retrieval: System finds top-4 relevant documents
- Groq Generation: Llama 3.3 generates grounded answer
- Citations: All facts include document IDs
- Click " Rebuild Knowledge Base" in sidebar to regenerate documents
- Useful after updating the source CSV data
- Automatically re-ingests into vector database
If Groq API key is missing:
- System falls back to template-based responses
- Still functional but less sophisticated
- Add key to
.envand restart app
cd Rag
python document_generator.pyExpected output: Saved 49 documents to data/processed
cd Rag
python rag.pyRuns 5 test queries and displays answers.
cd Rag
python analytics.pyRun: cd Rag && python document_generator.py
Edit .env file and add your API key. Restart the app.
Some packages may show version conflicts. These are non-critical and can be ignored unless functionality is broken.
Delete chroma_db/ folder and restart app to regenerate the vector database.
Time Period: FY24 & FY25 (April-October)
Segments:
- Health Insurance
- Motor (Total)
- Miscellaneous (incl. Crop)
- Fire & Property
- Personal Accident
- Engineering
- Liability
- Marine
- Aviation
Companies: 34 insurance companies tracked
Metrics:
- Premium volumes (YTD)
- YoY growth rates
- Monthly premium trends
- Portfolio concentration
- Risk classifications
- Company Summaries: Premium, top segment, volatility, risk notes
- Segment Analysis: Market share, YoY growth, characteristics, risks
- Risk Classifications: Crop risk, health strategy types, concentration
- Industry Overview: Total premium, trends, strategic insights
- Growth Patterns: Momentum analysis, quality indicators
- API keys stored in
.env(gitignored) - No data sent outside Groq API
- All data processing happens locally
- ChromaDB runs locally (no cloud)
- Query Response: 1-3 seconds
- Knowledge Base Gen: ~5 seconds for 49 documents
- Vector Ingestion: ~10 seconds for full corpus
- Embedding Model: Runs on CPU, very efficient
This project uses:
- Groq API (subject to Groq's terms)
- ChromaDB (Apache 2.0)
- Streamlit (Apache 2.0)
- Sentence Transformers (Apache 2.0)
For issues or questions:
- Check the Troubleshooting section
- Verify Groq API key is valid
- Ensure all dependencies are installed
- Review data paths in code