Releases: AutoCookies/pomaicache
Pomai Cache V1.3.1 Release
Remove Palloc
Pomai Cache V1.3.0
Add token cache
Pomai Cache V1.2.0 Release
Using http custom, using palloc to memory allocation, single threaeded
Pomai Cache V1.1.0 Release
Pomai Cache v1.1.0 Release Notes
Release Date: February 28, 2026
Tag: pomaicache-v1.1.0
Codename: AI-First
Summary
Pomai Cache v1.1.0 transforms the project from a Redis-compatible KV cache into a purpose-built AI-first cache for inference pipelines. This release adds semantic similarity search, pipeline-aware cascade invalidation, token-economics-aware eviction, streaming response caching, a compression/quantization engine, a Python SDK, and comprehensive benchmarks -- all implemented as zero-dependency native C++ with no external library requirements.
New Features
1. Semantic Similarity Cache
VectorIndexwith brute-force ANN search supporting Cosine, L2, and DotProduct distance metrics- SSE2 SIMD-accelerated distance computation (~0.7 GFLOP/s single-thread)
- Int8 vector quantization with 4x memory reduction (MAE < 0.0005/dim)
- Float16 quantization with 2x memory reduction (MAE < 0.00003/dim)
- New commands:
AI.SIM.PUT,AI.SIM.GET - Enables cache hits on paraphrased prompts -- benchmarks show 87-100% hit rate where exact-match caches get 0%
2. Pipeline-Aware Cascade Invalidation
DepGraph(directed acyclic graph) tracks parent-child artifact dependencies- Invalidating a parent embedding automatically cascades to all downstream responses, rerank buffers, and RAG chunks
- New commands:
AI.PUT ... DEPENDS_ON parent1 parent2,AI.INVALIDATE CASCADE <key> - Prevents stale derived artifacts from surviving source invalidation
3. Token-Economics-Aware Eviction
- Extended
ArtifactMetawithinference_tokens,inference_latency_ms,dollar_cost PomaiCostPolicynow factors inference cost into eviction benefit scoring- Budget management: set a $/hour spend cap and the cache prioritizes high-cost artifacts
- New commands:
AI.COST.REPORT,AI.BUDGET <dollars_per_hour> - Cost tracking reports total dollars saved, tokens saved, and latency saved
4. Streaming & Chunked Response Caching
- Incrementally store LLM streaming responses as they arrive token-by-token
- Retrieve partial or completed streams
- New commands:
AI.STREAM.BEGIN,AI.STREAM.APPEND,AI.STREAM.END,AI.STREAM.GET - Prevents re-inference when identical streaming requests arrive concurrently
5. Compression & Quantization Engine
CompressionEnginewith Run-Length Encoding (RLE) and Delta compression- Auto-selects best compression strategy per payload
- Float32-to-Float16 and Float32-to-Int8 embedding quantization
- Prefix deduplication for prompt families
- Compression ratio tracked in
AI.STATS
6. Python SDK (pomai-cache)
- Synchronous (
PomaiCache) and async (AsyncPomaiCache) clients - Full coverage of all AI-first commands:
sim_put,sim_get,stream_begin/append/end/get,cost_report,budget,invalidate_cascade - Native NumPy and PyTorch vector conversion
@memoizedecorator for transparent function-level caching- OpenTelemetry integration for distributed tracing
- Installable via pip:
pip install pomai-cache
Ported & Rebranded Data Structures (from Dragonfly)
These were rewritten from scratch as zero-dependency Pomai Cache native code:
BloomFilter/ScalableBloomFilter-- probabilistic membership testing for content-based deduplication fast-pathsCountMinSketch/MultiWindowSketch-- space-efficient frequency estimation integrated into the eviction policy'sp_reusecalculation- FNV-1a hashing -- custom hash function replacing all external hash dependencies
The third_party/dragonfly directory has been removed. All functionality is native.
Benchmarks
New: Vector Cache Benchmark (pomai_cache_vector_bench)
Comprehensive benchmark suite comparing Pomai Cache against Redis+RediSearch, Milvus, Qdrant, Weaviate, and Pinecone across:
| Metric | dim128 (1K) | dim768 (10K) | dim1536 (10K) |
|---|---|---|---|
| Insert throughput | 28K vec/s | 3K vec/s | 3K vec/s |
| Search p50 latency | 410 us | 22 ms | 45 ms |
| Memory per vector | 615 B | 3.2 KB | 6.2 KB |
End-to-end similarity cache vs exact-match cache:
| Scenario | Exact-Match Hit% | Similarity Hit% | Boost |
|---|---|---|---|
| Identical prompts | 0% | 100% | +100% |
| Slight rephrase | 0% | 100% | +100% |
| Moderate rephrase | 0% | 100% | +100% |
| Heavy rephrase | 0% | 87.8% | +87.8% |
Outputs JSON results to vector_bench_results.json for CI integration.
Server Improvements
INFOcommand now aggregates engine stats from all shards as a proper RESP bulk stringSETcommand correctly parsesEX(seconds) andPX(milliseconds) TTL optionsSETpropagates engine errors (e.g., oversized values) instead of always returning+OKCONFIG GET POLICYreturns the active policy nameAiArtifactCacheis now a persistent member ofEngineShard(share-nothing per-core), fixing statelessness bugs withAI.STATSandAI.INVALIDATEAI.STATSreports all new metrics: similarity queries/hits, cascade invalidations, stream counts, compression ratio, cost savings
Bug Fixes
- Fixed
sim_puthardcoding artifact type to "embedding", causing type mismatch rejection when storing other artifact types via similarity API - Fixed integration test failures from missing/incorrect RESP formatting in
INFO,CONFIG GET POLICY, andAI.GEThandlers - Fixed
AI.INVALIDATEnot cleaning up vector index and dependency graph entries for removed keys
Test Coverage
- 35+ test cases across 4 test suites, all passing
- 21 new test cases for AI-first features covering:
- VectorIndex insert/search/remove, cosine/L2/dot-product metrics, Int8 quantization
- DepGraph edges, cascade traversal, node removal
- CompressionEngine RLE, Delta, Float16/Int8 quantization round-trips
- Streaming begin/append/end/get lifecycle
- Token economics cost tracking and budget enforcement
- Pipeline cascade invalidation end-to-end
- AI.STATS metric reporting completeness
File Summary
New files (17):
include/pomai_cache/bloom_filter.hpp,src/ds/bloom_filter.cppinclude/pomai_cache/count_min_sketch.hpp,src/ds/count_min_sketch.cppinclude/pomai_cache/vector_index.hpp,src/ds/vector_index.cppinclude/pomai_cache/dep_graph.hpp,src/ds/dep_graph.cppinclude/pomai_cache/compression.hpp,src/ds/compression.cppbench/vector_cache_bench.cppsdk/python/pyproject.tomlsdk/python/pomai_cache/__init__.pysdk/python/pomai_cache/client.pysdk/python/pomai_cache/resp.pysdk/python/pomai_cache/decorators.py
Modified files (7):
CMakeLists.txt-- added new source files and vector benchmark targetinclude/pomai_cache/ai_cache.hpp-- extended with similarity, streaming, cost, cascade APIsinclude/pomai_cache/engine_shard.hpp-- persistentAiArtifactCacheper shardinclude/pomai_cache/policy.hpp-- frequency estimate inCandidateViewsrc/server/ai_cache.cpp-- implemented all new AI-first methodssrc/server/server_main.cpp-- new command handlers, fixed RESP formattingsrc/policy/policies.cpp-- CMS frequency in benefit scoringtests/test_ai_cache.cpp-- 21 new test cases
Removed:
third_party/dragonfly/-- entire directory removed; all useful code rewritten natively
Upgrade Notes
- Backward compatible -- all v1.0.0 RESP commands and AI commands work unchanged
- New commands are additive; existing clients require no changes
- Python SDK is a new optional component; install with
pip install pomai-cache - The
pomai_costeviction policy now uses frequency estimates from CountMinSketch; behavior is improved but slightly different from v1.0.0's pure heuristic scoring
Pomai Cache V1.0.0
Motivation
Provide a local, best-effort AI Artifact Cache layer for embeddings, prompts, RAG chunks, rerank buffers and responses with cache semantics (lossy, TTL/capacity-driven, warm restart).
Use deterministic canonical keys and a content-addressed blob indirection to enable deduplication and compact metadata handling.
Expose AI-native commands over RESP so existing clients like redis-cli can store, fetch, invalidate, and introspect AI artifacts.
Description
Added a new AiArtifactCache layer (include/pomai_cache/ai_cache.hpp, src/server/ai_cache.cpp) that stores typed ArtifactMeta + payloads with a blob:<content_hash> indirection, best-effort refcounts, per-epoch/model/prefix bounded indexes, and introspection (stats, top_hot, top_costly, explain).
Extended the RESP server (src/server/server_main.cpp) with AI commands: AI.PUT, AI.GET, AI.MGET, AI.EMB.PUT, AI.EMB.GET, AI.INVALIDATE EPOCH|MODEL|PREFIX, AI.STATS, AI.TOP HOT|COSTLY, and AI.EXPLAIN, and instantiated AiArtifactCache in the server loop.
Implemented deterministic canonical key helpers (emb/prm/rag/rrk/rsp) and owner TTL defaults + type-based miss_cost guidance; updated engine owner miss-cost defaults in src/engine/engine.cpp to bias policy for AI artifact owners.
Added AI benchmark bench/ai_artifact_bench.cpp and wired up CMake (CMakeLists.txt) plus tests (tests/test_ai_cache.cpp, extended tests/test_integration.cpp) and docs (docs/AI_CACHE.md, docs/AI_COMMANDS.md, docs/INVALIDATION.md, docs/BLOB_DEDUP.md), and updated README.md with quickstart examples and recommended AI config.
Testing
Built the project (cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug && cmake --build build -j) and all test targets compiled successfully. ✅
Ran the full test suite (ctest --test-dir build --output-on-failure), and all tests including the new AI unit and integration tests passed. ✅
Built and executed the AI benchmark target (pomai_cache_ai_bench) which produced a JSON summary (ops/s, p50/p95/p99/p999, hit_rate) for the embedding workload; the bench target compiles and runs but longer runs may require tuning in constrained CI environments (a longer timeout was observed during an extended run).
What's Changed
- Codex-generated pull request by @AutoCookies in #1
- Phase 2 hardening: netbench, robust tests, latency controls, and CI upgrades by @AutoCookies in #2
- Codex-generated pull request by @AutoCookies in #4
- Add AI artifact cache layer and AI.* RESP commands (embeddings/prompts/RAG/dedup/invalidation) by @AutoCookies in #5
New Contributors
- @AutoCookies made their first contribution in #1
Full Changelog: https://github.com/AutoCookies/pomaicache/commits/pomaicache-v1.0.0