Skip to content

Okja88/heritage-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏮 Bridging 19th-century Peranakan rituals with 21st-century Agentic AI. AI-Powered Insights from "A Baba Wedding" by Cheo Kim Ban

An advanced Retrieval-Augmented Generation (RAG) application designed to preserve and share Peranakan wedding rituals. This project utilizes a local-first AI architecture to ensure data privacy and cost-efficiency while providing high-fidelity historical information.

🚀 Key Features

  • Zero-Footprint Security: Built-in regex scrubbing for Singaporean NRIC (including FIN/WP/EP) and phone numbers to ensure data privacy before LLM processing.

  • Privacy-Centric RAG: Operates entirely on a local infrastructure using LM Studio and ChromaDB, ensuring no sensitive data leaves the local environment.

  • Intelligent Guardrails: Includes custom PII Scrubbing (specifically for Singaporean NRIC/Phone patterns and also Foreigners' WP and EP) and Prompt Injection detection.

  • Optimized Memory: Implements a Sliding Window context, passing only the last 5 messages to the LLM to minimize latency and token overhead. Leverages all-MiniLM-L6-v2 and LM Studio for a fully offline, cost-effective RAG architecture.

  • Multi-Modal Retrieval: Intelligent mapping of historical wedding rituals from "A Baba Wedding" to archival imagery via a custom metadata dictionary when a user asks for visuals and retrieves relevant historical images or AI-reconstructions from the metadata.

  • Performance Telemetry: Logs every interaction's latency and response time to a performance_logs.txt file for system benchmarking.

📂 Folder Structure heritage-bot/ ├── main.py # Primary Streamlit application and UI logic ├── ingest.py # Script to process PDF and build ChromaDB ├── requirements.txt # Python dependencies (LangChain, Streamlit, etc.) ├── README.md # Project documentation and setup guide ├── .gitignore # Critical for excluding local/private data ├── data_dictionary.json # Master metadata for image mapping and captions │ ├── data1/ # [LOCAL ONLY] Folder for archival images (not for upload) ├── chroma_db_v3/ # [LOCAL ONLY] Persisted vector database (not for upload) │ ├── history_*.json # [LOCAL ONLY] User-specific chat history files ├── performance_logs.txt # [LOCAL ONLY] Latency and benchmarking data └── telemetry_log.txt # [LOCAL ONLY] Image feedback and download logs

🛠️ Technical Stack

  • Frontend: Streamlit
  • Orchestration: LangChain (LCEL)
  • LLM: Llama 3.1 8B (via LM Studio)
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2 (HuggingFace)
  • Vector Store: ChromaDB

📋 Installation & Setup

  1. Clone the Repository: git clone https://github.com/okja88/heritage-bot.git cd heritage-bot

  2. Install Dependencies: pip install -r requirements.txt

  3. Configure LM Studio:

  1. Ingest Heritage Data:
  • lace your PDF and JSON metadata in the project root.
  • Run the ingestion script to build the vector database: python ingest.py
  1. Run the Application: streamlit run main.py

🛡️ Security & Compliance This bot is built with a Zero-Footprint strategy:

  • Regex-based Masking: Automatically redacts Singaporean and Foreigners(Employment Pass and Work Pass)identifiers before they reach the LLM.
  • Heuristic Injection Check: Blocks adversarial "jailbreak" attempts.
  • Fair Dealing Compliance: All retrieved images include automated citations and Fair Dealing notices for educational use.

Acknowledgement & Attribution, Software License

Acknowledgement & Attribution

  • Ngee Ann Polytechnic: For guidance during the Specialist Diploma in Applied Generative AI.
  • Research & Reference: Special thanks to [Cheo Kim Ban], author of [A Baba Wedding], whose detailed documentation of Peranakan customs provided the historical foundation for this project's AI reconstructions.

Software License

This project is licensed under the MIT License - see the LICENSE file for details.

AI-Generated Assets

AI Reconstruction of Peranakan Heritage. Generated for educational/research purposes using Google Gemini 3 Flash. Note: All rights regarding historical accuracy and original visual references are reserved to the original archive sources.

📊 Project Background The project focuses on bridging the gap between historical print archives and modern conversational AI.

📊 Performance Benchmarks (Demo Results) Based on internal testing of the RAG Pipeline:

Average Query Latency: ~2.45s (Local Llama 3.1 8B via LM Studio).

PII Scrubbing: 100% success rate in redacting Singaporean NRIC and Phone formats during demo trials.

Multi-Modal Accuracy: Successfully retrieved and displayed archival images from "A Baba Wedding" using metadata-matching.

About

AI-powered Peranakan heritage assistant featuring Local RAG, multi-modal retrieval, and Singapore-specific PII guardrails. Built with LangChain, Streamlit, and Llama 3.1.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages