Catch hallucinations, drift, and unused context before your users do.
Groundedness scoring for RAG pipelines and AI coding agents, with a one-call path to upgrade data quality — from messy input files to fully generated markdown and knowledge graphs — as well as a high-performance retrieval engine (OSS).
Charge your RAG pipelines and harnesses based on real data.
Quickstart • Trace • Upgrade Data Quality • Upgrade Retrieval • Trace Reference • Full Tutorial
pip install latence
export LATENCE_API_KEY="lat_..."from latence import Latence
client = Latence() # reads LATENCE_API_KEY from the environment
r = client.experimental.trace.rag(
response_text="Paris is the capital of France.",
raw_context="France's capital city is Paris.",
)
print(r.score, r.band, r.context_coverage_ratio, r.context_unused_ratio)That's it. You now know whether the answer was grounded, how much of your retrieved context was actually used, and whether to trust it.
Three lanes, one mental model. Pick the one that matches what your app is doing right now.
from latence import Latence
client = Latence()
r = client.experimental.trace.rag(
response_text="Paris is the capital of France.",
raw_context="France's capital city is Paris.",
)
print(r.score) # 0.0 - 1.0
print(r.band) # "green" | "amber" | "red" | "unknown"
print(r.context_coverage_ratio) # how much of the answer is grounded in context
print(r.context_unused_ratio) # how much retrieved context was dead weightChain turns with the opaque next_session_state handoff. The SDK never forces you to track session internals.
turn1 = client.experimental.trace.code(
response_text="def add(a, b): return a + b",
raw_context="# utils.py\ndef sub(a, b): return a - b",
response_language_hint="python",
)
turn2 = client.experimental.trace.code(
response_text="def mul(a, b): return a * b",
raw_context="# utils.py\ndef sub(a, b): return a - b",
response_language_hint="python",
session_state=turn1.next_session_state, # chain turns
)
print(turn2.band)
print(turn2.session_signals.recommendation) # "continue" | "re_anchor" | "fresh_chat"Hosted Trace pricing is $0.008/request by default. For higher-cost quality
mode, pass profile="quality" to trace.rag(...) or trace.code(...);
quality requests bill at $0.016/request.
Stateless, CPU-only, sub-ms on the pod. Safe to call on every keystroke.
rollup = client.experimental.trace.rollup(turns=[turn1, turn2])
print(rollup.noise_pct) # fraction of turns flagged as noise
print(rollup.retrieval_waste_pct) # fraction of retrieved context left unused
print(rollup.model_drift_pct) # fraction of turns with drift
print(rollup.reason_code_histogram) # why the turns failed, aggregated
print(rollup.risk_band_trail) # per-turn band, chronological
print(rollup.recommendations) # actionable session-level adviceThe numbers above are not diagnostics. They are routing rules:
| Signal | Meaning | Next step |
|---|---|---|
band amber/red, low context_coverage_ratio |
The answer isn't grounded in what you retrieved. | Upgrade data quality — your upstream documents are the bottleneck. |
High context_unused_ratio, retrieval_waste_pct > 30% |
You retrieved the wrong chunks. | Upgrade retrieval — your retriever is the bottleneck. |
session_signals.recommendation = "re_anchor" / "fresh_chat" on the code lane |
Session drift is compounding. | Reset the agent's context on the next turn. |
Full reference: Trace docs and SDK tutorial §18.
Every method above has an await-able twin under AsyncLatence:
from latence import AsyncLatence
async with AsyncLatence() as client:
r = await client.experimental.trace.rag(
response_text="Paris is the capital of France.",
raw_context="France's capital city is Paris.",
)Trace is showing low coverage or amber/red bands? The model is rarely the problem. It's usually the upstream data: un-OCR'd PDFs, missing entities, unresolved references. The Latence Data Intelligence Pipeline cleans that in one call.
job = client.pipeline.run(files=["contract.pdf"])
pkg = job.wait_for_completion()
print(pkg.document.markdown) # clean markdown
print(pkg.entities.summary) # {"total": 142, "by_type": {...}}
print(pkg.knowledge_graph.summary.total_relations) # 87
pkg.download_archive("./results.zip")Smart defaults: OCR → entity extraction → relation extraction. Configure any step explicitly:
job = client.pipeline.run(
files=["contract.pdf"],
steps={
"ocr": {"mode": "performance"},
"redaction": {"mode": "balanced", "redact": True},
"extraction": {"label_mode": "hybrid", "threshold": 0.3},
"relation_extraction": {"resolve_entities": True},
},
)Every run returns a structured DataPackage:
pkg.document— markdown + per-page layout (OCR)pkg.entities— entity list + summary (extraction)pkg.knowledge_graph— entities + relations + graph summary (relation extraction)pkg.redaction— cleaned text + PII list (redaction)pkg.compression— compressed text + ratio (compression)pkg.quality— per-stage confidence, latency, cost
Power users: the typed PipelineBuilder accepts YAML and validates client-side. See docs/pipelines.md for the full orchestration reference (DAG execution, resumable jobs, progress callbacks).
Feed pipeline outputs into client.experimental.dataset_intelligence_service to build corpus-wide knowledge graphs, ontologies, and enriched feature spaces with incremental ingestion:
| Tier | Method | What it does |
|---|---|---|
| 1 | di.enrich() |
Semantic feature vectors (CPU-only, fast) |
| 2 | di.build_graph() |
Entity resolution, knowledge graph, link prediction |
| 3 | di.build_ontology() |
Concept clustering, hierarchy induction |
| Full | di.run() |
All three tiers sequentially |
See docs/dataset_intelligence.md.
If Trace keeps flagging a high context_unused_ratio, or the session rollup shows retrieval_waste_pct > 30%, your model isn't the problem — your retrieval engine is shipping the wrong chunks.
→ ColSearch — High Performance Late Interaction and multimodal search engine
ColSearch is our late-interaction retrieval engine: token-level ColBERT recall, native multimodal search over PDFs and images, and a drop-in replacement for the retrieval step in your RAG stack. Wire it in and context_unused_ratio collapses.
from latence import (
LatenceError, AuthenticationError, InsufficientCreditsError,
RateLimitError, JobError, JobTimeoutError, TransportError,
)
try:
r = client.experimental.trace.rag(
response_text="Paris is the capital of France.",
raw_context="France's capital city is Paris.",
)
except AuthenticationError:
... # 401
except InsufficientCreditsError:
... # 402
except RateLimitError as e:
... # 429, retry after e.retry_after
except JobError as e:
... # pipeline job failed; check e.is_resumable
except TransportError:
... # network / DNSThe SDK retries on 429 and 5xx with exponential backoff (default 2 retries, respects Retry-After).
export LATENCE_API_KEY="lat_your_key"from latence import Latence
import latence
client = Latence(
api_key="lat_...", # or LATENCE_API_KEY env var
base_url="https://...", # or LATENCE_BASE_URL env var
timeout=60.0, # request timeout (default: 60s)
max_retries=2, # retry attempts (default: 2)
)
latence.setup_logging("DEBUG") # logs every HTTP request/response| Trace reference | docs/trace.md — parameters and full response schema |
| Full tutorial | SDK_TUTORIAL.md — every service, every parameter |
| API docs | docs.latence.ai |
| Portal | app.latence.ai |
MIT License • latence.ai