Gateframe

Behavioral validation for LLM outputs in production workflows.

Schema validation, "does this JSON have the right keys?", is a solved problem. Instructor, Pydantic AI, and similar tools handle it well. gateframe solves a different problem: does this output behave correctly given the context it was generated in? Does it stay within the decision boundaries this workflow requires? When it fails, does it fail in a way your system can recover from, or does it fail silently?

from pydantic import BaseModel
from gateframe import (
    ValidationContract,
    StructuralRule,
    BoundaryRule,
    ConfidenceRule,
    AllowedValues,
    FailureMode,
)

class TriageDecision(BaseModel):
    action: str
    priority: str
    confidence: float
    rationale: str

contract = ValidationContract(
    name="triage_decision",
    rules=[
        StructuralRule(schema=TriageDecision),
        BoundaryRule(
            check=AllowedValues("action", {"treat", "observe", "refer", "discharge"}),
            name="action_boundary",
            failure_message="Action must be one of: treat, observe, refer, discharge.",
        ),
        ConfidenceRule(field="confidence", minimum=0.7),
    ],
)

result = contract.validate({
    "action": "prescribe",       # not in allowed set -> HARD_FAIL
    "priority": "high",
    "confidence": 0.52,          # below 0.7 -> SOFT_FAIL
    "rationale": "...",
})

print(result.passed)             # False
for failure in result.failures:
    print(f"[{failure.failure_mode.value}] {failure.rule_name}: {failure.message}")
# [hard_fail] action_boundary: Action must be one of: treat, observe, refer, discharge.
# [soft_fail] confidence_check: Confidence 0.52 is below minimum threshold 0.7.

The problem

Most LLM pipelines validate outputs the same way: parse the JSON, check the schema, move on. That catches structural errors. It misses the errors that actually cause production incidents:

A model recommends an action that is structurally valid but outside its authorized scope
Confidence is low but the workflow proceeds as if it weren't
A soft failure in step 2 silently degrades the reliability of everything downstream
A validation failure gives you False, and no context for debugging

gateframe makes these failures explicit, structured, and recoverable.

Failure modes

gateframe distinguishes four failure types instead of binary pass/fail.

HARD_FAIL, Stop. The output violates a hard constraint that cannot be auto-recovered.

# Model chose an action outside its authorized scope
BoundaryRule(
    check=AllowedValues("action", {"treat", "observe", "refer"}),
    failure_mode=FailureMode.HARD_FAIL,  # default for BoundaryRule
)

SOFT_FAIL, Flag and continue with degraded confidence. Something is off but not critical enough to halt.

# Model confidence is low, continue but track the degradation
ConfidenceRule(
    field="confidence",
    minimum=0.7,
    failure_mode=FailureMode.SOFT_FAIL,  # default for ConfidenceRule
)

RETRY, Re-prompt with the failure context. The output is likely fixable by trying again.

# Malformed output that might parse correctly on a second attempt
StructuralRule(schema=MyOutput, failure_mode=FailureMode.RETRY)

SILENT_FAIL, The most dangerous kind. The output looks valid but violates a semantic or boundary rule. gateframe makes these visible instead of letting them pass through undetected.

SemanticRule(
    check=lambda output, **ctx: output["severity"] != "low" or output["escalated"] is False,
    failure_mode=FailureMode.SILENT_FAIL,
    failure_message="Low-severity cases should not be auto-escalated.",
)

Multi-step workflow validation

Validation state carries forward across steps. A soft failure in step 2 degrades the confidence score that step 4 sees.

from gateframe import WorkflowContext, ValidationContract, EscalationRouter
from gateframe.audit.log import AuditLog

ctx = WorkflowContext(workflow_id="incident_response_001", escalation_threshold=0.5)
router = EscalationRouter()
audit = AuditLog()

# Step 1
result1 = contract_step1.validate(output1)
ctx.update(result1)
audit.record(result1, workflow_context=ctx)

# Step 2, ctx carries forward degraded confidence from step 1
result2 = contract_step2.validate(output2)
ctx.update(result2)
audit.record(result2, workflow_context=ctx)

print(ctx.confidence)           # degraded from 1.0 by soft failures
print(ctx.threshold_breached)   # True if confidence < escalation_threshold

if ctx.threshold_breached:
    escalation = router.route_threshold_breach(ctx)
    print(escalation.route.value)  # "human_review", "abort", etc.

Provider integrations

gateframe validates outputs from any provider. Integrations are thin wrappers, gateframe does not import any LLM SDK at the core level.

# OpenAI
from gateframe.integrations.openai import OpenAIValidator
validator = OpenAIValidator(contract, parse_json=True)
result = validator.validate(openai_completion)

# Anthropic
from gateframe.integrations.anthropic import AnthropicValidator
validator = AnthropicValidator(contract, parse_json=True)
result = validator.validate(anthropic_message)

# LiteLLM
from gateframe.integrations.litellm import LiteLLMValidator
validator = LiteLLMValidator(contract, parse_json=True)
result = validator.validate(litellm_response)

# LangChain
from gateframe.integrations.langchain import LangChainValidator
validator = LangChainValidator(contract, parse_json=False)
result = validator.validate(chain_output)

Install the integration you need:

pip install "gateframe[openai]"
pip install "gateframe[anthropic]"
pip install "gateframe[litellm]"
pip install "gateframe[langchain]"

Audit trail

Every validation event is logged with structured context. Use the built-in exporters or implement your own.

from gateframe.audit.log import AuditLog
from gateframe.audit.exporters import JsonFileExporter

audit = AuditLog(exporters=[JsonFileExporter("audit.jsonl")])
audit.record(result, workflow_context=ctx)
audit.flush()

Each entry includes: timestamp, contract name, rules applied, rules failed, failure details, workflow ID, and accumulated confidence score.

Trend analysis

ContractTrendAnalyzer reads an existing JSONL audit log and computes per-contract pass-rate trends across workflow runs. Use it to detect reliability regressions before they become incidents.

from pathlib import Path
from gateframe.audit.trend import ContractTrendAnalyzer

analyzer = ContractTrendAnalyzer(Path("audit.jsonl"), window=20)
report = analyzer.analyze()

if report.any_regression:
    for ct in report.regressions:
        print(f"{ct.contract_name}: {ct.direction} (slope={ct.slope:.4f})")

The window parameter controls how many most-recent workflow runs are included. A contract is flagged as regressing when its pass-rate slope falls below -0.02 (two percentage points per run) by default.

When to use gateframe

Use it when:

You need to validate LLM output behavior beyond schema checks, decision boundaries, scope enforcement, semantic constraints
You need structured, recoverable failure records rather than bare exceptions
You're running multi-step workflows where soft failures in early steps should affect confidence downstream
You need an audit trail for post-incident debugging

Don't use it when:

You only need schema extraction from LLM outputs, use Instructor or Pydantic AI
You need offline model evaluation or benchmarking, use DeepEval or RAGAS
You need content safety filtering, use a dedicated guardrails tool

Installation

pip install gateframe

For development:

git clone https://github.com/practicalmind-ai/gateframe.git
cd gateframe
pip install -e ".[dev]"
python -m pytest tests/ -v

Examples

triage_workflow, 3-step medical triage pipeline. Demonstrates StructuralRule, BoundaryRule, ConfidenceRule, and WorkflowContext together. Step 2 has confidence below threshold, shows how SOFT_FAIL degrades the workflow score without halting it.

rag_output, RAG answer validation with two scenarios. Scenario B demonstrates simultaneous soft failures (low confidence + ungrounded answer) and how they accumulate in the workflow context.

agent_pipeline, 4-step agent workflow with escalation. Demonstrates how multiple soft failures across steps push cumulative confidence below the escalation threshold.

CLI

# Inspect a contract file, lists all contracts and their rules
gateframe inspect contracts.py

# Replay an audit log
gateframe replay audit.jsonl

# Detect pass-rate regressions across workflow runs
gateframe trend audit.jsonl --window 20

License

MIT, see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
examples		examples
gateframe		gateframe
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gateframe

The problem

Failure modes

Multi-step workflow validation

Provider integrations

Audit trail

Trend analysis

When to use gateframe

Installation

Examples

CLI

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gateframe

The problem

Failure modes

Multi-step workflow validation

Provider integrations

Audit trail

Trend analysis

When to use gateframe

Installation

Examples

CLI

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages