GitHub - Okja88/heritage-bot: AI-powered Peranakan heritage assistant featuring Local RAG, multi-modal retrieval, and Singapore-specific PII guardrails. Built with LangChain, Streamlit, and Llama 3.1.

🏮 Bridging 19th-century Peranakan rituals with 21st-century Agentic AI. AI-Powered Insights from "A Baba Wedding" by Cheo Kim Ban

An advanced Retrieval-Augmented Generation (RAG) application designed to preserve and share Peranakan wedding rituals. This project utilizes a local-first AI architecture to ensure data privacy and cost-efficiency while providing high-fidelity historical information.

🚀 Key Features

Zero-Footprint Security: Built-in regex scrubbing for Singaporean NRIC (including FIN/WP/EP) and phone numbers to ensure data privacy before LLM processing.
Privacy-Centric RAG: Operates entirely on a local infrastructure using LM Studio and ChromaDB, ensuring no sensitive data leaves the local environment.
Intelligent Guardrails: Includes custom PII Scrubbing (specifically for Singaporean NRIC/Phone patterns and also Foreigners' WP and EP) and Prompt Injection detection.
Optimized Memory: Implements a Sliding Window context, passing only the last 5 messages to the LLM to minimize latency and token overhead. Leverages all-MiniLM-L6-v2 and LM Studio for a fully offline, cost-effective RAG architecture.
Multi-Modal Retrieval: Intelligent mapping of historical wedding rituals from "A Baba Wedding" to archival imagery via a custom metadata dictionary when a user asks for visuals and retrieves relevant historical images or AI-reconstructions from the metadata.
Performance Telemetry: Logs every interaction's latency and response time to a performance_logs.txt file for system benchmarking.

📂 Folder Structure heritage-bot/ ├── main.py # Primary Streamlit application and UI logic ├── ingest.py # Script to process PDF and build ChromaDB ├── requirements.txt # Python dependencies (LangChain, Streamlit, etc.) ├── README.md # Project documentation and setup guide ├── .gitignore # Critical for excluding local/private data ├── data_dictionary.json # Master metadata for image mapping and captions │ ├── data1/ # [LOCAL ONLY] Folder for archival images (not for upload) ├── chroma_db_v3/ # [LOCAL ONLY] Persisted vector database (not for upload) │ ├── history_*.json # [LOCAL ONLY] User-specific chat history files ├── performance_logs.txt # [LOCAL ONLY] Latency and benchmarking data └── telemetry_log.txt # [LOCAL ONLY] Image feedback and download logs

🛠️ Technical Stack

Frontend: Streamlit
Orchestration: LangChain (LCEL)
LLM: Llama 3.1 8B (via LM Studio)
Embeddings: sentence-transformers/all-MiniLM-L6-v2 (HuggingFace)
Vector Store: ChromaDB

📋 Installation & Setup

Clone the Repository: git clone https://github.com/okja88/heritage-bot.git cd heritage-bot
Install Dependencies: pip install -r requirements.txt
Configure LM Studio:

Load meta-llama-3.1-8b-instruct.
Start the Local Server at http://localhost:1234.

Ingest Heritage Data:

lace your PDF and JSON metadata in the project root.
Run the ingestion script to build the vector database: python ingest.py

Run the Application: streamlit run main.py

🛡️ Security & Compliance This bot is built with a Zero-Footprint strategy:

Regex-based Masking: Automatically redacts Singaporean and Foreigners(Employment Pass and Work Pass)identifiers before they reach the LLM.
Heuristic Injection Check: Blocks adversarial "jailbreak" attempts.
Fair Dealing Compliance: All retrieved images include automated citations and Fair Dealing notices for educational use.

Acknowledgement & Attribution, Software License

Acknowledgement & Attribution

Ngee Ann Polytechnic: For guidance during the Specialist Diploma in Applied Generative AI.
Research & Reference: Special thanks to [Cheo Kim Ban], author of [A Baba Wedding], whose detailed documentation of Peranakan customs provided the historical foundation for this project's AI reconstructions.

Software License

This project is licensed under the MIT License - see the LICENSE file for details.

AI-Generated Assets

AI Reconstruction of Peranakan Heritage. Generated for educational/research purposes using Google Gemini 3 Flash. Note: All rights regarding historical accuracy and original visual references are reserved to the original archive sources.

📊 Project Background The project focuses on bridging the gap between historical print archives and modern conversational AI.

📊 Performance Benchmarks (Demo Results) Based on internal testing of the RAG Pipeline:

Average Query Latency: ~2.45s (Local Llama 3.1 8B via LM Studio).

PII Scrubbing: 100% success rate in redacting Singaporean NRIC and Phone formats during demo trials.

Multi-Modal Accuracy: Successfully retrieved and displayed archival images from "A Baba Wedding" using metadata-matching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Acknowledgement & Attribution, Software License

Acknowledgement & Attribution

Software License

AI-Generated Assets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
data_dictionary.json		data_dictionary.json
ingest.py		ingest.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Acknowledgement & Attribution, Software License

Acknowledgement & Attribution

Software License

AI-Generated Assets

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages