Skip to content

neuralpunk/lattice0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lattice0

logo

A graph-primary memory system for LLM agents. Single binary, zero dependencies, RAM-resident, with LLM-driven consolidation that makes it smarter while you sleep.


The thesis

Every existing memory system for LLMs is a thin wrapper over a vector database. They are flat, they degrade as they grow, and they treat memory as a passive store.

lattice0 makes three bets nobody else is making simultaneously:

  1. RAM-first, not RAM-cached. The entire working set -- vectors, graph, lexical indices, metadata -- lives resident in memory. Queries hit L3/RAM speeds, not SSD speeds.

  2. Graph-primary, not vector-primary. The knowledge graph is the core data model. Vectors and lexical indices are entry points into the graph. Retrieval returns connected subgraphs, not ranked chunk lists.

  3. Consolidation is a real process, driven by an LLM. When sessions end or go idle, lattice0 enters a sleep phase where it uses an LLM to merge duplicates, detect contradictions, promote episodic details into semantic facts, write summaries, and propose new graph edges. The system gets better while you are not using it.

Why lattice0 over mem0, Memoripy, etc.

Most LLM memory systems are vector stores with an API wrapper. They store embeddings, retrieve by similarity, and call it a day. That works for simple Q&A, but it falls apart when your agent needs to reason across sessions, track evolving knowledge, or operate without cloud dependencies.

lattice0 is structurally different:

mem0 Memoripy lattice0
Data model Flat vector store Short/long-term vector split Knowledge graph + vectors + lexical
Retrieval Vector similarity Vector + recency decay Hybrid (BM25 + HNSW + RRF + graph expansion + cross-encoder reranking)
Relationships None None Typed edges (mentions, contradicts, supersedes, derived_from, etc.)
Consolidation None None LLM-driven: dedup, contradiction detection, fact promotion, summaries, edge proposals, decay
Infrastructure Requires Qdrant/Postgres/Redis Requires API keys Single binary, zero runtime deps
Latency Network-bound (cloud vector DB) API-call-bound Single-digit ms (RAM-resident)
LLM backend Cloud-only (OpenAI) Cloud-only (OpenAI/Gemini) Claude Code, Anthropic API, Ollama, or any OpenAI-compatible endpoint
Privacy Data goes through cloud APIs Data goes through cloud APIs Everything runs locally, LLM calls configurable

Where lattice0 excels:

  • Long-running agents that accumulate context across sessions. The graph means related memories surface together. The sleep cycle means stale information gets pruned and contradictions get flagged, not silently stored.
  • Developer tooling (Claude Code, Cursor, Windsurf). Sub-5ms recall means zero perceptible latency in the coding loop. No external database to set up or maintain.
  • Privacy-sensitive environments. Everything runs on your machine. Use Ollama with a local model and nothing leaves your hardware.
  • Agents that need to track evolving knowledge. "The API key rotates every 90 days" supersedes "The API key is permanent." Most memory systems store both and hope the retriever picks the right one. lattice0 marks the contradiction with a contradicts edge and uses recency + access patterns to surface the current fact.
  • Workloads where retrieval quality matters more than storage volume. The hybrid pipeline (lexical + vector + graph expansion + reranking) finds relevant memories that pure vector search misses, especially for exact-match queries and multi-hop reasoning.

Where lattice0 is not the right choice:

  • Multi-user / multi-tenant systems (lattice0 is single-user, single-node)
  • Massive document corpus RAG (lattice0 is optimized for agent memories, not bulk document retrieval)
  • Cloud-native architectures that need managed infrastructure

Install

Requirements: Rust toolchain (stable, 1.85+), Python 3 (for Claude Code hooks), Claude Code CLI installed and authenticated.

git clone https://github.com/neuralpunk/lattice0.git
cd lattice0
make release
sudo make install

This builds an optimised binary and copies it to /usr/local/bin/lattice0.

Setup

lattice0 setup
lattice0 models download

That's it. setup does everything in one command:

  • Creates the data directory (~/.local/share/lattice0/)
  • Writes a default config with Claude Code as the LLM provider
  • Registers the MCP server with Claude Code
  • Disables Claude Code's built-in auto-memory (lattice0 replaces it)
  • Installs hooks for automatic memory recall and persistence

models download fetches the embedding model (~127MB, bge-small-en-v1.5 via ONNX) and the cross-encoder reranker (~22MB, ms-marco-MiniLM-L-6-v2, quantized int8) from Hugging Face for semantic search and result reranking.

Restart Claude Code after setup. lattice0 is now your agent's memory.

Usage

# Store a memory
lattice0 remember "The API auth tokens rotate every 90 days"

# Store with tags
lattice0 remember -t deadlines "Project deadline is June 15"

# Pipe from stdin
cat meeting_notes.txt | lattice0 remember -t meetings -

# Search memories (hybrid: vector + lexical + graph expansion)
lattice0 recall "auth tokens"

# List all memories
lattice0 recall ""

# See system state
lattice0 status

# Run consolidation (merge duplicates, detect contradictions, write summaries)
lattice0 sleep

# Show retrieval diagnostics
lattice0 trace "auth tokens"

# Manually link two memories
lattice0 link <source_id> <target_id> --type mentions

# Forget a memory
lattice0 forget <id>

# Machine-readable output
lattice0 recall --json "auth tokens"
lattice0 status --json

All commands accept --data-dir to override the storage location. Default data directory follows XDG conventions ($XDG_DATA_HOME/lattice0), overridable via the LATTICE0_DATA_DIR environment variable.

CLI reference

lattice0 setup                       # Full setup: init + MCP + hooks (idempotent)
lattice0 init                        # Initialize data directory and config only
lattice0 remember [--tag X] <text>   # Store a memory (reads stdin if text is "-" or omitted)
lattice0 recall <query>              # Hybrid retrieval (returns subgraph)
lattice0 recall ""                   # List all memories
lattice0 forget <id> [--force]       # Mark a memory as forgotten
lattice0 link <src> <dst> --type X   # Create a typed edge between two memories
lattice0 sleep                       # Force a consolidation cycle
lattice0 status [--json]             # Show system state and last sleep report
lattice0 trace <query>               # Show the query plan and retrieval scores
lattice0 serve                       # Run as MCP server (stdio, JSON-RPC 2.0)
lattice0 server-status [--json]      # Show MCP server process info (PID, uptime, RSS)
lattice0 models download             # Download embedding + reranker models
lattice0 reindex                     # Re-embed memories and rebuild vector index
lattice0 archive list                # List sleep generations
lattice0 archive diff <gen1> <gen2>  # Diff two archive generations

Process management

Claude Code owns the lattice0 server lifecycle. When Claude Code starts, it spawns lattice0 serve as a child process based on the MCP registration from lattice0 setup. When Claude Code exits, the server dies with it. You don't need to start it manually.

# Check if the server is running, see PID, uptime, and memory usage
lattice0 server-status

# Machine-readable version
lattice0 server-status --json

# Kill the server (Claude Code will respawn it next session)
pkill -f 'lattice0.*serve'

# Unregister lattice0 from Claude Code entirely
claude mcp remove lattice0

# Re-register after updating the binary
lattice0 setup

The server persists state to disk after every write operation (atomic rename via a .tmp file), so killing the process at any time is safe -- no data is lost.

How it works

Storing memories

When you run lattice0 remember "I prefer tabs over spaces", three things happen:

Content addressing (BLAKE3). Your text gets hashed into a unique 64-character hex ID. The same text always produces the same ID, so duplicates are automatically detected. BLAKE3 is a cryptographic hash function -- fast, collision-resistant, and deterministic.

Embedding (ONNX + bge-small-en-v1.5). Your text gets converted into a list of 384 numbers (a vector) that represents its meaning. The model bge-small-en-v1.5 is a small neural network that does this conversion. ONNX is the file format the model is stored in, and the ort crate runs it natively in Rust without Python. Later, when you search for "indentation preferences", the query also gets converted to 384 numbers, and we find memories whose vectors are close in meaning -- even if they share no words with the query.

Indexing. The memory is inserted into three separate indices simultaneously:

  • Lexical index (tantivy) -- a full-text search engine. Finds exact word matches using BM25 scoring.
  • Vector index (HNSW) -- stores embeddings in a Hierarchical Navigable Small World graph. Instead of comparing against every memory, it navigates a network of connections to find the nearest neighbors fast. This is how meaning-based search works.
  • Bitmap index (roaring) -- compressed bitsets tracking which memories have which tags. Can intersect millions of tag filters in microseconds.

Searching memories (the retrieval pipeline)

When you run lattice0 recall "auth tokens":

Step 1: Parse. Extract keywords, remove stopwords.

Step 2: Parallel search. Two searches run simultaneously:

  • Lexical search finds memories containing the words "auth" and "tokens"
  • Vector search embeds your query into 384 numbers and finds the nearest memories by meaning

These find different things. Lexical catches exact matches. Vector catches semantic matches (e.g., "authentication credentials" matches "auth tokens" even though the words differ).

Step 3: Reciprocal Rank Fusion (RRF). The two ranked lists are merged. For each memory, its score = sum of 1/(rank + 60) across all lists it appears in. A memory ranked #1 in both lists scores high. A memory in only one list scores lower. This simple formula beats most learned combination methods in published benchmarks.

Step 4: Graph expansion. The top results become seed nodes in the knowledge graph. We walk 1-2 hops outward along typed edges. If memory A mentions entity B, and B elaborates on C, we pull in B and C as additional context. The result is a connected subgraph of related memories, not a flat list.

Step 5: Cross-encoder reranking (optional). If a reranker model is available, the top candidates are re-scored by a cross-encoder that reads both the query and each memory together. This is slower than vector search (it processes each pair individually) but significantly more accurate for final ordering. If no reranker model is present, the RRF scores are used directly.

Step 6: Return results with scores, provenance, and relationships.

The knowledge graph

Every memory is a node. Edges between nodes have types:

Edge type Meaning
mentions Memory A references entity/memory B
elaborates A expands on B
supersedes A replaces B (after consolidation merges duplicates)
contradicts A and B disagree (explicitly marked)
derived_from A is a summary or consolidation of B
co_occurred A and B appeared in the same session
caused A led to B
about A is about entity B

The graph is stored as a CSR (Compressed Sparse Row) adjacency -- all edges packed contiguously in memory so walking from node to node hits the CPU cache instead of jumping around RAM. Each hop is nanoseconds.

Edges are created manually (link command), automatically during ingestion, or by the LLM during sleep consolidation. As the graph grows, retrieval gets richer because you get context, not just matches.

Sleep / consolidation

Consolidation runs when you explicitly call lattice0 sleep, when the MCP server has been idle for 5 minutes (configurable via sleep.idle_timeout_secs), or when a client disconnects (session end). The system sends your memories to an LLM (Claude, via the claude CLI on your machine) and asks it to:

  1. Merge duplicates -- find memories that say the same thing, write one canonical version, link the old ones with supersedes edges. Originals are preserved -- nothing is deleted without explicit user intent.
  2. Detect contradictions -- "I use tabs" vs "I switched to spaces" get marked with a contradicts edge rather than silently storing both.
  3. Promote facts -- forty mentions of "works at Acme Corp" collapse into a single semantic fact with derived_from edges back to the episodes.
  4. Write summaries -- group memories by day or topic, generate concise summaries as new retrievable nodes.
  5. Propose edges -- the LLM suggests relationships the system missed, with justifications stored as edge metadata.
  6. Decay and forget -- memories that are old, rarely accessed, and poorly connected get pruned. The decay model combines exponential recency, access frequency (log-scaled), and graph connectivity. Pinned memories are never forgotten.

The sleep report is persisted and shown in lattice0 status.

What's on disk

Everything lives in ~/.local/share/lattice0/:

  • config.toml -- settings (LLM provider, model paths, decay thresholds)
  • state.json -- all memories + graph edges (loaded into RAM on each command)
  • models/bge-small-en-v1.5/ -- the ONNX embedding model files
  • models/ms-marco-MiniLM-L-6-v2/ -- the ONNX cross-encoder reranker model files

The "RAM-first" design means that during any command, everything is deserialized into memory, indices are rebuilt, work is done at RAM speed, and state is saved back. The MCP server keeps everything resident for the duration of its process.

MCP integration

lattice0 serve runs an MCP server over stdio (JSON-RPC 2.0). The setup command registers it with Claude Code automatically.

The server exposes seven tools to the LLM: remember, recall, forget, link, status, sleep, and trace. State is persisted to disk after every write operation.

Claude Code hooks installed by setup:

  • UserPromptSubmit -- automatically searches lattice0 and injects relevant memories into every prompt
  • PreToolUse -- blocks writes to MEMORY.md files (all memory goes through lattice0)
  • PreCompact -- prompts Claude to save critical context before context window compaction

The MCP server's initialize response includes instructions that tell Claude when and how to use each tool. Setup is idempotent -- run it again after updating the binary and it skips steps already configured.

LLM backends for consolidation

Sleep uses an LLM for all editorial work. Supported backends:

  • Claude Code (subprocess) -- default after setup. Uses the claude CLI already on your machine. No API key needed.
  • Anthropic API -- direct API calls. Set ANTHROPIC_API_KEY in your environment.
  • OpenAI-compatible endpoints -- any endpoint that speaks the OpenAI Chat Completions API: Ollama, vLLM, LM Studio, llama.cpp server, text-generation-inference, etc.

Example config.toml for Ollama:

[llm]
provider = "openai_compatible"
model = "llama3.1"
api_base = "http://localhost:11434"

No API key is needed for local servers. If your endpoint requires one, set the environment variable named in api_key_env (defaults to ANTHROPIC_API_KEY, but you can change it to e.g. OPENAI_API_KEY).

Planned but not yet implemented:

  • Claude Desktop -- MCP client to Claude Desktop.

Configure the provider in config.toml (written by lattice0 setup).

Monitoring

Watch lattice0 in real time from another terminal:

# Refresh status every 2 seconds
watch -n 2 'lattice0 status'

# Watch all memories
watch -n 2 'lattice0 recall ""'

Performance targets

Metric Target
Recall latency (hot tier, no reranking) p50 < 5ms, p99 < 20ms
Recall latency (with reranking) p50 < 30ms, p99 < 100ms
Ingest latency p50 < 2ms (sync), async embedding
Hot tier capacity ~10M memories on 64GB machine
Sleep cycle duration Proportional to changes since last sleep, not total size

These are measured targets for working sets up to 1M memories, not theoretical claims.

Status output

lattice0 status
-------------------------------
  Data dir:         /home/user/.local/share/lattice0
  Memories:         847
  Pinned:           12
  Graph nodes:      893
  Graph edges:      2,141
  Vectors indexed:  847
  Lexical indexed:  847
  Embedder:         loaded
  Reranker:         loaded
  Generation:       42
  Hot tier (est):   2.1 MiB

  Last sleep:       2026-04-14T02:11:03Z
    Duration:       47.2s
    Merged:         12
    Contradictions: 2
    Promoted:       4
    New edges:      31
    Forgotten:      7
    Summaries:      9
    Generation:     gen-0042 (parent gen-0041)

What is implemented (v0.3)

Working end-to-end:

  • init, setup, remember, recall, forget, link, status, trace, serve, reindex, models download, archive list, archive diff
  • Hybrid retrieval: tantivy BM25 + HNSW vector search + RRF fusion + graph expansion + cross-encoder reranking
  • Full MCP server with seven tools, JSON-RPC 2.0 over stdio -- including live consolidation via sleep
  • One-command Claude Code integration (MCP registration, hooks, auto-memory disable)
  • ONNX embedding (bge-small-en-v1.5, 384-dim) with automatic model download
  • ONNX cross-encoder reranking (ms-marco-MiniLM-L-6-v2, quantized int8) with automatic model download
  • Full sleep consolidation pipeline: duplicate merging, contradiction detection, semantic promotion, summary writing, edge proposal, exponential decay
  • Sleep works from CLI (lattice0 sleep), MCP (sleep tool), idle timeout (auto-trigger after inactivity), and session end (auto-trigger on client disconnect)
  • Embedding backfill: memories stored before the embedder was available are automatically embedded during sleep
  • Content-addressed storage with BLAKE3 hashing
  • Roaring bitmap filters for metadata
  • Archive browsing: list generations, diff between generations
  • 41 unit tests across all crates
  • Makefile for build/install/reset

Not yet wired:

  • Warm tier mmap archives (archive format designed but not persisted to separate files)
  • Sketch-based routing (cuckoo filters, count-min, HyperLogLog use exact data structures as placeholders -- correct but less memory-efficient at scale)
  • Entity extraction at ingest time (currently only during sleep via LLM)
  • Claude Desktop MCP client provider (stub -- use claude_code, anthropic_api, or openai_compatible instead)
  • PageRank importance scoring during consolidation
  • Published benchmarks

Hardware requirements

This tool assumes you have a real machine. If you are running on a 4GB VPS or a Chromebook, use something else.

Minimum Recommended
RAM 16 GB 32-64 GB DDR5
Storage NVMe SSD PCIe 4.0+ NVMe
CPU 4 cores, AVX2 or NEON 8+ cores, AVX-512 or Apple Silicon

lattice0 is a single-node system. No clustering, no sharding, no multi-tenancy. One user, one lattice.

Project structure

crates/
  core/       Memory chunks, knowledge graph, edge types, config
  index/      Composite index: HNSW vectors, tantivy lexical, roaring bitmaps, sketches
  embed/      ONNX embedding (bge-small-en-v1.5) and cross-encoder reranking
  retrieve/   Query planning, RRF fusion, graph expansion, retrieval pipeline
  llm/        LLM provider trait + Claude Code, Anthropic API, OpenAI-compatible backends
  sleep/      Consolidation engine: duplicate merge, contradiction, promotion, decay
  mcp/        MCP server (JSON-RPC 2.0, stdio)
  cli/        Binary entry point and command implementations

License

MIT

About

A graph-primary memory system for LLM agents. Single binary, zero dependencies, RAM-resident, with LLM-driven consolidation that makes it smarter while you sleep.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors