🏮 Bridging 19th-century Peranakan rituals with 21st-century Agentic AI. AI-Powered Insights from "A Baba Wedding" by Cheo Kim Ban
An advanced Retrieval-Augmented Generation (RAG) application designed to preserve and share Peranakan wedding rituals. This project utilizes a local-first AI architecture to ensure data privacy and cost-efficiency while providing high-fidelity historical information.
🚀 Key Features
-
Zero-Footprint Security: Built-in regex scrubbing for Singaporean NRIC (including FIN/WP/EP) and phone numbers to ensure data privacy before LLM processing.
-
Privacy-Centric RAG: Operates entirely on a local infrastructure using LM Studio and ChromaDB, ensuring no sensitive data leaves the local environment.
-
Intelligent Guardrails: Includes custom PII Scrubbing (specifically for Singaporean NRIC/Phone patterns and also Foreigners' WP and EP) and Prompt Injection detection.
-
Optimized Memory: Implements a Sliding Window context, passing only the last 5 messages to the LLM to minimize latency and token overhead. Leverages all-MiniLM-L6-v2 and LM Studio for a fully offline, cost-effective RAG architecture.
-
Multi-Modal Retrieval: Intelligent mapping of historical wedding rituals from "A Baba Wedding" to archival imagery via a custom metadata dictionary when a user asks for visuals and retrieves relevant historical images or AI-reconstructions from the metadata.
-
Performance Telemetry: Logs every interaction's latency and response time to a performance_logs.txt file for system benchmarking.
📂 Folder Structure heritage-bot/ ├── main.py # Primary Streamlit application and UI logic ├── ingest.py # Script to process PDF and build ChromaDB ├── requirements.txt # Python dependencies (LangChain, Streamlit, etc.) ├── README.md # Project documentation and setup guide ├── .gitignore # Critical for excluding local/private data ├── data_dictionary.json # Master metadata for image mapping and captions │ ├── data1/ # [LOCAL ONLY] Folder for archival images (not for upload) ├── chroma_db_v3/ # [LOCAL ONLY] Persisted vector database (not for upload) │ ├── history_*.json # [LOCAL ONLY] User-specific chat history files ├── performance_logs.txt # [LOCAL ONLY] Latency and benchmarking data └── telemetry_log.txt # [LOCAL ONLY] Image feedback and download logs
🛠️ Technical Stack
- Frontend: Streamlit
- Orchestration: LangChain (LCEL)
- LLM: Llama 3.1 8B (via LM Studio)
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (HuggingFace)
- Vector Store: ChromaDB
📋 Installation & Setup
-
Clone the Repository: git clone https://github.com/okja88/heritage-bot.git cd heritage-bot
-
Install Dependencies: pip install -r requirements.txt
-
Configure LM Studio:
- Load meta-llama-3.1-8b-instruct.
- Start the Local Server at http://localhost:1234.
- Ingest Heritage Data:
- lace your PDF and JSON metadata in the project root.
- Run the ingestion script to build the vector database: python ingest.py
- Run the Application: streamlit run main.py
🛡️ Security & Compliance This bot is built with a Zero-Footprint strategy:
- Regex-based Masking: Automatically redacts Singaporean and Foreigners(Employment Pass and Work Pass)identifiers before they reach the LLM.
- Heuristic Injection Check: Blocks adversarial "jailbreak" attempts.
- Fair Dealing Compliance: All retrieved images include automated citations and Fair Dealing notices for educational use.
- Ngee Ann Polytechnic: For guidance during the Specialist Diploma in Applied Generative AI.
- Research & Reference: Special thanks to [Cheo Kim Ban], author of [A Baba Wedding], whose detailed documentation of Peranakan customs provided the historical foundation for this project's AI reconstructions.
This project is licensed under the MIT License - see the LICENSE file for details.
AI Reconstruction of Peranakan Heritage. Generated for educational/research purposes using Google Gemini 3 Flash. Note: All rights regarding historical accuracy and original visual references are reserved to the original archive sources.
📊 Project Background The project focuses on bridging the gap between historical print archives and modern conversational AI.
📊 Performance Benchmarks (Demo Results) Based on internal testing of the RAG Pipeline:
Average Query Latency: ~2.45s (Local Llama 3.1 8B via LM Studio).
PII Scrubbing: 100% success rate in redacting Singaporean NRIC and Phone formats during demo trials.
Multi-Modal Accuracy: Successfully retrieved and displayed archival images from "A Baba Wedding" using metadata-matching.