A graph-primary memory system for LLM agents. Single binary, zero dependencies, RAM-resident, with LLM-driven consolidation that makes it smarter while you sleep.
Every existing memory system for LLMs is a thin wrapper over a vector database. They are flat, they degrade as they grow, and they treat memory as a passive store.
lattice0 makes three bets nobody else is making simultaneously:
-
RAM-first, not RAM-cached. The entire working set -- vectors, graph, lexical indices, metadata -- lives resident in memory. Queries hit L3/RAM speeds, not SSD speeds.
-
Graph-primary, not vector-primary. The knowledge graph is the core data model. Vectors and lexical indices are entry points into the graph. Retrieval returns connected subgraphs, not ranked chunk lists.
-
Consolidation is a real process, driven by an LLM. When sessions end or go idle, lattice0 enters a sleep phase where it uses an LLM to merge duplicates, detect contradictions, promote episodic details into semantic facts, write summaries, and propose new graph edges. The system gets better while you are not using it.
Most LLM memory systems are vector stores with an API wrapper. They store embeddings, retrieve by similarity, and call it a day. That works for simple Q&A, but it falls apart when your agent needs to reason across sessions, track evolving knowledge, or operate without cloud dependencies.
lattice0 is structurally different:
| mem0 | Memoripy | lattice0 | |
|---|---|---|---|
| Data model | Flat vector store | Short/long-term vector split | Knowledge graph + vectors + lexical |
| Retrieval | Vector similarity | Vector + recency decay | Hybrid (BM25 + HNSW + RRF + graph expansion + cross-encoder reranking) |
| Relationships | None | None | Typed edges (mentions, contradicts, supersedes, derived_from, etc.) |
| Consolidation | None | None | LLM-driven: dedup, contradiction detection, fact promotion, summaries, edge proposals, decay |
| Infrastructure | Requires Qdrant/Postgres/Redis | Requires API keys | Single binary, zero runtime deps |
| Latency | Network-bound (cloud vector DB) | API-call-bound | Single-digit ms (RAM-resident) |
| LLM backend | Cloud-only (OpenAI) | Cloud-only (OpenAI/Gemini) | Claude Code, Anthropic API, Ollama, or any OpenAI-compatible endpoint |
| Privacy | Data goes through cloud APIs | Data goes through cloud APIs | Everything runs locally, LLM calls configurable |
Where lattice0 excels:
- Long-running agents that accumulate context across sessions. The graph means related memories surface together. The sleep cycle means stale information gets pruned and contradictions get flagged, not silently stored.
- Developer tooling (Claude Code, Cursor, Windsurf). Sub-5ms recall means zero perceptible latency in the coding loop. No external database to set up or maintain.
- Privacy-sensitive environments. Everything runs on your machine. Use Ollama with a local model and nothing leaves your hardware.
- Agents that need to track evolving knowledge. "The API key rotates every 90 days" supersedes "The API key is permanent." Most memory systems store both and hope the retriever picks the right one. lattice0 marks the contradiction with a
contradictsedge and uses recency + access patterns to surface the current fact. - Workloads where retrieval quality matters more than storage volume. The hybrid pipeline (lexical + vector + graph expansion + reranking) finds relevant memories that pure vector search misses, especially for exact-match queries and multi-hop reasoning.
Where lattice0 is not the right choice:
- Multi-user / multi-tenant systems (lattice0 is single-user, single-node)
- Massive document corpus RAG (lattice0 is optimized for agent memories, not bulk document retrieval)
- Cloud-native architectures that need managed infrastructure
Requirements: Rust toolchain (stable, 1.85+), Python 3 (for Claude Code hooks), Claude Code CLI installed and authenticated.
git clone https://github.com/neuralpunk/lattice0.git
cd lattice0
make release
sudo make installThis builds an optimised binary and copies it to /usr/local/bin/lattice0.
lattice0 setup
lattice0 models downloadThat's it. setup does everything in one command:
- Creates the data directory (
~/.local/share/lattice0/) - Writes a default config with Claude Code as the LLM provider
- Registers the MCP server with Claude Code
- Disables Claude Code's built-in auto-memory (lattice0 replaces it)
- Installs hooks for automatic memory recall and persistence
models download fetches the embedding model (~127MB, bge-small-en-v1.5 via ONNX) and the cross-encoder reranker (~22MB, ms-marco-MiniLM-L-6-v2, quantized int8) from Hugging Face for semantic search and result reranking.
Restart Claude Code after setup. lattice0 is now your agent's memory.
# Store a memory
lattice0 remember "The API auth tokens rotate every 90 days"
# Store with tags
lattice0 remember -t deadlines "Project deadline is June 15"
# Pipe from stdin
cat meeting_notes.txt | lattice0 remember -t meetings -
# Search memories (hybrid: vector + lexical + graph expansion)
lattice0 recall "auth tokens"
# List all memories
lattice0 recall ""
# See system state
lattice0 status
# Run consolidation (merge duplicates, detect contradictions, write summaries)
lattice0 sleep
# Show retrieval diagnostics
lattice0 trace "auth tokens"
# Manually link two memories
lattice0 link <source_id> <target_id> --type mentions
# Forget a memory
lattice0 forget <id>
# Machine-readable output
lattice0 recall --json "auth tokens"
lattice0 status --jsonAll commands accept --data-dir to override the storage location. Default data directory follows XDG conventions ($XDG_DATA_HOME/lattice0), overridable via the LATTICE0_DATA_DIR environment variable.
lattice0 setup # Full setup: init + MCP + hooks (idempotent)
lattice0 init # Initialize data directory and config only
lattice0 remember [--tag X] <text> # Store a memory (reads stdin if text is "-" or omitted)
lattice0 recall <query> # Hybrid retrieval (returns subgraph)
lattice0 recall "" # List all memories
lattice0 forget <id> [--force] # Mark a memory as forgotten
lattice0 link <src> <dst> --type X # Create a typed edge between two memories
lattice0 sleep # Force a consolidation cycle
lattice0 status [--json] # Show system state and last sleep report
lattice0 trace <query> # Show the query plan and retrieval scores
lattice0 serve # Run as MCP server (stdio, JSON-RPC 2.0)
lattice0 server-status [--json] # Show MCP server process info (PID, uptime, RSS)
lattice0 models download # Download embedding + reranker models
lattice0 reindex # Re-embed memories and rebuild vector index
lattice0 archive list # List sleep generations
lattice0 archive diff <gen1> <gen2> # Diff two archive generations
Claude Code owns the lattice0 server lifecycle. When Claude Code starts, it spawns lattice0 serve as a child process based on the MCP registration from lattice0 setup. When Claude Code exits, the server dies with it. You don't need to start it manually.
# Check if the server is running, see PID, uptime, and memory usage
lattice0 server-status
# Machine-readable version
lattice0 server-status --json
# Kill the server (Claude Code will respawn it next session)
pkill -f 'lattice0.*serve'
# Unregister lattice0 from Claude Code entirely
claude mcp remove lattice0
# Re-register after updating the binary
lattice0 setupThe server persists state to disk after every write operation (atomic rename via a .tmp file), so killing the process at any time is safe -- no data is lost.
When you run lattice0 remember "I prefer tabs over spaces", three things happen:
Content addressing (BLAKE3). Your text gets hashed into a unique 64-character hex ID. The same text always produces the same ID, so duplicates are automatically detected. BLAKE3 is a cryptographic hash function -- fast, collision-resistant, and deterministic.
Embedding (ONNX + bge-small-en-v1.5). Your text gets converted into a list of 384 numbers (a vector) that represents its meaning. The model bge-small-en-v1.5 is a small neural network that does this conversion. ONNX is the file format the model is stored in, and the ort crate runs it natively in Rust without Python. Later, when you search for "indentation preferences", the query also gets converted to 384 numbers, and we find memories whose vectors are close in meaning -- even if they share no words with the query.
Indexing. The memory is inserted into three separate indices simultaneously:
- Lexical index (tantivy) -- a full-text search engine. Finds exact word matches using BM25 scoring.
- Vector index (HNSW) -- stores embeddings in a Hierarchical Navigable Small World graph. Instead of comparing against every memory, it navigates a network of connections to find the nearest neighbors fast. This is how meaning-based search works.
- Bitmap index (roaring) -- compressed bitsets tracking which memories have which tags. Can intersect millions of tag filters in microseconds.
When you run lattice0 recall "auth tokens":
Step 1: Parse. Extract keywords, remove stopwords.
Step 2: Parallel search. Two searches run simultaneously:
- Lexical search finds memories containing the words "auth" and "tokens"
- Vector search embeds your query into 384 numbers and finds the nearest memories by meaning
These find different things. Lexical catches exact matches. Vector catches semantic matches (e.g., "authentication credentials" matches "auth tokens" even though the words differ).
Step 3: Reciprocal Rank Fusion (RRF). The two ranked lists are merged. For each memory, its score = sum of 1/(rank + 60) across all lists it appears in. A memory ranked #1 in both lists scores high. A memory in only one list scores lower. This simple formula beats most learned combination methods in published benchmarks.
Step 4: Graph expansion. The top results become seed nodes in the knowledge graph. We walk 1-2 hops outward along typed edges. If memory A mentions entity B, and B elaborates on C, we pull in B and C as additional context. The result is a connected subgraph of related memories, not a flat list.
Step 5: Cross-encoder reranking (optional). If a reranker model is available, the top candidates are re-scored by a cross-encoder that reads both the query and each memory together. This is slower than vector search (it processes each pair individually) but significantly more accurate for final ordering. If no reranker model is present, the RRF scores are used directly.
Step 6: Return results with scores, provenance, and relationships.
Every memory is a node. Edges between nodes have types:
| Edge type | Meaning |
|---|---|
mentions |
Memory A references entity/memory B |
elaborates |
A expands on B |
supersedes |
A replaces B (after consolidation merges duplicates) |
contradicts |
A and B disagree (explicitly marked) |
derived_from |
A is a summary or consolidation of B |
co_occurred |
A and B appeared in the same session |
caused |
A led to B |
about |
A is about entity B |
The graph is stored as a CSR (Compressed Sparse Row) adjacency -- all edges packed contiguously in memory so walking from node to node hits the CPU cache instead of jumping around RAM. Each hop is nanoseconds.
Edges are created manually (link command), automatically during ingestion, or by the LLM during sleep consolidation. As the graph grows, retrieval gets richer because you get context, not just matches.
Consolidation runs when you explicitly call lattice0 sleep, when the MCP server has been idle for 5 minutes (configurable via sleep.idle_timeout_secs), or when a client disconnects (session end). The system sends your memories to an LLM (Claude, via the claude CLI on your machine) and asks it to:
- Merge duplicates -- find memories that say the same thing, write one canonical version, link the old ones with
supersedesedges. Originals are preserved -- nothing is deleted without explicit user intent. - Detect contradictions -- "I use tabs" vs "I switched to spaces" get marked with a
contradictsedge rather than silently storing both. - Promote facts -- forty mentions of "works at Acme Corp" collapse into a single semantic fact with
derived_fromedges back to the episodes. - Write summaries -- group memories by day or topic, generate concise summaries as new retrievable nodes.
- Propose edges -- the LLM suggests relationships the system missed, with justifications stored as edge metadata.
- Decay and forget -- memories that are old, rarely accessed, and poorly connected get pruned. The decay model combines exponential recency, access frequency (log-scaled), and graph connectivity. Pinned memories are never forgotten.
The sleep report is persisted and shown in lattice0 status.
Everything lives in ~/.local/share/lattice0/:
config.toml-- settings (LLM provider, model paths, decay thresholds)state.json-- all memories + graph edges (loaded into RAM on each command)models/bge-small-en-v1.5/-- the ONNX embedding model filesmodels/ms-marco-MiniLM-L-6-v2/-- the ONNX cross-encoder reranker model files
The "RAM-first" design means that during any command, everything is deserialized into memory, indices are rebuilt, work is done at RAM speed, and state is saved back. The MCP server keeps everything resident for the duration of its process.
lattice0 serve runs an MCP server over stdio (JSON-RPC 2.0). The setup command registers it with Claude Code automatically.
The server exposes seven tools to the LLM: remember, recall, forget, link, status, sleep, and trace. State is persisted to disk after every write operation.
Claude Code hooks installed by setup:
- UserPromptSubmit -- automatically searches lattice0 and injects relevant memories into every prompt
- PreToolUse -- blocks writes to
MEMORY.mdfiles (all memory goes through lattice0) - PreCompact -- prompts Claude to save critical context before context window compaction
The MCP server's initialize response includes instructions that tell Claude when and how to use each tool. Setup is idempotent -- run it again after updating the binary and it skips steps already configured.
Sleep uses an LLM for all editorial work. Supported backends:
- Claude Code (subprocess) -- default after
setup. Uses theclaudeCLI already on your machine. No API key needed. - Anthropic API -- direct API calls. Set
ANTHROPIC_API_KEYin your environment. - OpenAI-compatible endpoints -- any endpoint that speaks the OpenAI Chat Completions API: Ollama, vLLM, LM Studio, llama.cpp server, text-generation-inference, etc.
Example config.toml for Ollama:
[llm]
provider = "openai_compatible"
model = "llama3.1"
api_base = "http://localhost:11434"No API key is needed for local servers. If your endpoint requires one, set the environment variable named in api_key_env (defaults to ANTHROPIC_API_KEY, but you can change it to e.g. OPENAI_API_KEY).
Planned but not yet implemented:
- Claude Desktop -- MCP client to Claude Desktop.
Configure the provider in config.toml (written by lattice0 setup).
Watch lattice0 in real time from another terminal:
# Refresh status every 2 seconds
watch -n 2 'lattice0 status'
# Watch all memories
watch -n 2 'lattice0 recall ""'| Metric | Target |
|---|---|
| Recall latency (hot tier, no reranking) | p50 < 5ms, p99 < 20ms |
| Recall latency (with reranking) | p50 < 30ms, p99 < 100ms |
| Ingest latency | p50 < 2ms (sync), async embedding |
| Hot tier capacity | ~10M memories on 64GB machine |
| Sleep cycle duration | Proportional to changes since last sleep, not total size |
These are measured targets for working sets up to 1M memories, not theoretical claims.
lattice0 status
-------------------------------
Data dir: /home/user/.local/share/lattice0
Memories: 847
Pinned: 12
Graph nodes: 893
Graph edges: 2,141
Vectors indexed: 847
Lexical indexed: 847
Embedder: loaded
Reranker: loaded
Generation: 42
Hot tier (est): 2.1 MiB
Last sleep: 2026-04-14T02:11:03Z
Duration: 47.2s
Merged: 12
Contradictions: 2
Promoted: 4
New edges: 31
Forgotten: 7
Summaries: 9
Generation: gen-0042 (parent gen-0041)
Working end-to-end:
init,setup,remember,recall,forget,link,status,trace,serve,reindex,models download,archive list,archive diff- Hybrid retrieval: tantivy BM25 + HNSW vector search + RRF fusion + graph expansion + cross-encoder reranking
- Full MCP server with seven tools, JSON-RPC 2.0 over stdio -- including live consolidation via
sleep - One-command Claude Code integration (MCP registration, hooks, auto-memory disable)
- ONNX embedding (bge-small-en-v1.5, 384-dim) with automatic model download
- ONNX cross-encoder reranking (ms-marco-MiniLM-L-6-v2, quantized int8) with automatic model download
- Full sleep consolidation pipeline: duplicate merging, contradiction detection, semantic promotion, summary writing, edge proposal, exponential decay
- Sleep works from CLI (
lattice0 sleep), MCP (sleeptool), idle timeout (auto-trigger after inactivity), and session end (auto-trigger on client disconnect) - Embedding backfill: memories stored before the embedder was available are automatically embedded during sleep
- Content-addressed storage with BLAKE3 hashing
- Roaring bitmap filters for metadata
- Archive browsing: list generations, diff between generations
- 41 unit tests across all crates
- Makefile for build/install/reset
Not yet wired:
- Warm tier mmap archives (archive format designed but not persisted to separate files)
- Sketch-based routing (cuckoo filters, count-min, HyperLogLog use exact data structures as placeholders -- correct but less memory-efficient at scale)
- Entity extraction at ingest time (currently only during sleep via LLM)
- Claude Desktop MCP client provider (stub -- use
claude_code,anthropic_api, oropenai_compatibleinstead) - PageRank importance scoring during consolidation
- Published benchmarks
This tool assumes you have a real machine. If you are running on a 4GB VPS or a Chromebook, use something else.
| Minimum | Recommended | |
|---|---|---|
| RAM | 16 GB | 32-64 GB DDR5 |
| Storage | NVMe SSD | PCIe 4.0+ NVMe |
| CPU | 4 cores, AVX2 or NEON | 8+ cores, AVX-512 or Apple Silicon |
lattice0 is a single-node system. No clustering, no sharding, no multi-tenancy. One user, one lattice.
crates/
core/ Memory chunks, knowledge graph, edge types, config
index/ Composite index: HNSW vectors, tantivy lexical, roaring bitmaps, sketches
embed/ ONNX embedding (bge-small-en-v1.5) and cross-encoder reranking
retrieve/ Query planning, RRF fusion, graph expansion, retrieval pipeline
llm/ LLM provider trait + Claude Code, Anthropic API, OpenAI-compatible backends
sleep/ Consolidation engine: duplicate merge, contradiction, promotion, decay
mcp/ MCP server (JSON-RPC 2.0, stdio)
cli/ Binary entry point and command implementations
MIT
