OpenPersistentMemory is a lightweight persistent-memory chatbot stack that works with vLLM-deployed OpenAI-compatible APIs (LLM + Embedding).
It adds long-term memory with FAISS, stores full conversations sequentially, and injects relevant history back into the model with a two-level retrieval strategy:
- Recent window (multi-turn): the latest turns from the current conversation thread
- Persistent memory (retrieved):
- Replay conversations (multi-turn): selected past user/assistant exchanges inserted as chat turns
- Summary-only memory (system): additional relevant summaries injected into the system prompt
It also includes a GPT-like Streamlit frontend with Chinese/English UI and a clear display of which memories were used each turn.
- ✅ OpenAI package compatible: backend uses
openaiPython client, pointed to your local vLLM endpoints viabase_url - ✅ Two types of retrieval
- Recent context: last N turns in the current conversation
- Persistent memory: FAISS retrieval + Gate (filter) + optional Rerank
- ✅ Conversation Gate (filter, not ranker): drops irrelevant candidates before reranking
- ✅ Strict de-duplication: anything already in the recent window is excluded from replay/summary memory
- ✅ Per-user isolation: different
user_idhas different FAISS index and storage folder (no shared memory) - ✅ Sequential conversation storage: every conversation is stored as append-only JSONL, easy to reload and rebuild recent window
- ✅ Streamlit frontend: GPT-like chat UI, configurable memory knobs, live display of retrieved history per turn
- ✅ i18n-ready UI: all frontend text & descriptions stored in
config.py(UI_TEXTdict), easy to add more languages
OpenPersistentMemory/
chatbot_api.py # FastAPI backend: OpenAI-like /v1/chat/completions
memory.py # Memory core: per-user FAISS + sequential logs + gate/rerank
frontend_streamlit.py # Streamlit frontend (GPT-like UI)
config.py # All configs: endpoints/models/prompts/storage + i18n UI texts
storage/ # Created at runtime
storage/
users/
<user_id_sanitized>/
faiss.index
meta.jsonl
state.json
conversations/
<conv_id>.jsonlconversations/<conv_id>.jsonlstores the full conversation sequentially, one record per turn:
{"turn_id": 12, "ts": 1700000000.0, "user": "...", "assistant": "..."}faiss.index+meta.jsonlstore persistent memory (summaries + raw messages) for retrieval.
Python 3.9+ recommended.
Install dependencies:
pip install -r requirements.txtOn some platforms you may prefer
conda install -c pytorch faiss-cpu.
You need two OpenAI-compatible endpoints:
- LLM chat endpoint:
.../v1/chat/completions - Embedding endpoint:
.../v1/embeddings
Typical setup:
- LLM server:
http://localhost:8000/v1 - Embedding server:
http://localhost:8001/v1
The system reads configuration from environment variables (also see config.py).
# Required by OpenAI client (can be dummy for local vLLM)
export OPENAI_API_KEY="EMPTY"
# Your vLLM endpoints
export LLM_BASE_URL="http://localhost:8000/v1"
export EMB_BASE_URL="http://localhost:8001/v1"
# Model names exposed by your servers
export LLM_MODEL="Qwen/Qwen3-4B-Instruct"
export EMB_MODEL="Qwen/Qwen3-Embedding-8B"
# Storage
export STORAGE_DIR="./storage"
export EMBEDDING_DIMS=0 # 0 = infer from first embedding
# Frontend -> backend
export BACKEND_URL="http://localhost:9000/v1/chat/completions"uvicorn chatbot_api:app --host 0.0.0.0 --port 9000curl http://localhost:9000/health
curl http://localhost:9000/health/embed
curl http://localhost:9000/health/llmIf health/embed fails → embedding endpoint config is wrong
If health/llm fails → LLM endpoint config is wrong
streamlit run frontend_streamlit.pyThe frontend reads BACKEND_URL from the environment; it is not editable in the UI.
For each user message:
-
Recent window: load last
recent_turnsturns fromconversations/<conv_id>.jsonl -
Retrieve candidates: FAISS retrieves
candidate_nmemory summaries -
De-dup: remove memories that correspond to turns already in the recent window
-
Gate (filter): LLM decides keep/drop candidates (binary filter, not ranking)
-
Rerank (optional): LLM selects top
retrieve_kmemories to replay -
Build prompt to LLM:
- System prompt
- Summary-only memory (system block)
- Replay conversations (multi-turn)
- Recent window (multi-turn)
- Current user message
-
After answering, the system:
- Appends the turn to the sequential conversation log (JSONL)
- Summarizes the turn and stores it (summary embedding + meta)
These are controlled in the Streamlit sidebar, with Chinese/English descriptions sourced from config.py.
- Recent window (turns): how many latest turns of this conversation are included directly
- Vector candidates (N): FAISS top-N candidates to consider before filtering
- Replay top-K: how many past turns are replayed as full multi-turn context
- Gate keep max: max number of candidates kept after gate
- Enable gate: on/off for the relevance filter (binary keep/drop)
- Enable rerank: on/off for LLM reranking (select best K to replay)
- Temperature / Max tokens: generation controls
- Different
user_id→ different memory index and storage folder → no shared memories - Different
conv_idunder the same user → different conversation thread for recent window
In the frontend:
User IDcontrols which user memory index you useConversation IDcontrols the thread (recent window) and log file- “New chat” generates a new
conv_id
Check backend console logs. The backend also returns JSON traceback in error responses.
If you switch embedding models, old FAISS indices may become incompatible.
Fix by deleting the user’s index files:
rm -rf storage/users/<user_id_sanitized>/(Or just delete faiss.index + meta.jsonl + state.json for that user.)
This repo uses widget-bound keys (ui_user_id, ui_conv_id) to avoid Streamlit key mutation errors.
Edit PromptConfig in config.py:
CHAT_SYSTEMSUMMARY_SYSTEMGATE_SYSTEMRERANK_SYSTEM
All UI strings live in UI_TEXT in config.py. Add a new language by adding a new top-level key (e.g. "JP") with the same structure.
This project is licensed under the MIT License.
See the LICENSE file for details.