Skip to content

practicalmind-dev/gateframe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gateframe

Behavioral validation for LLM outputs in production workflows.

Schema validation, "does this JSON have the right keys?", is a solved problem. Instructor, Pydantic AI, and similar tools handle it well. gateframe solves a different problem: does this output behave correctly given the context it was generated in? Does it stay within the decision boundaries this workflow requires? When it fails, does it fail in a way your system can recover from, or does it fail silently?

from pydantic import BaseModel
from gateframe import (
    ValidationContract,
    StructuralRule,
    BoundaryRule,
    ConfidenceRule,
    AllowedValues,
    FailureMode,
)

class TriageDecision(BaseModel):
    action: str
    priority: str
    confidence: float
    rationale: str

contract = ValidationContract(
    name="triage_decision",
    rules=[
        StructuralRule(schema=TriageDecision),
        BoundaryRule(
            check=AllowedValues("action", {"treat", "observe", "refer", "discharge"}),
            name="action_boundary",
            failure_message="Action must be one of: treat, observe, refer, discharge.",
        ),
        ConfidenceRule(field="confidence", minimum=0.7),
    ],
)

result = contract.validate({
    "action": "prescribe",       # not in allowed set -> HARD_FAIL
    "priority": "high",
    "confidence": 0.52,          # below 0.7 -> SOFT_FAIL
    "rationale": "...",
})

print(result.passed)             # False
for failure in result.failures:
    print(f"[{failure.failure_mode.value}] {failure.rule_name}: {failure.message}")
# [hard_fail] action_boundary: Action must be one of: treat, observe, refer, discharge.
# [soft_fail] confidence_check: Confidence 0.52 is below minimum threshold 0.7.

The problem

Most LLM pipelines validate outputs the same way: parse the JSON, check the schema, move on. That catches structural errors. It misses the errors that actually cause production incidents:

  • A model recommends an action that is structurally valid but outside its authorized scope
  • Confidence is low but the workflow proceeds as if it weren't
  • A soft failure in step 2 silently degrades the reliability of everything downstream
  • A validation failure gives you False, and no context for debugging

gateframe makes these failures explicit, structured, and recoverable.


Failure modes

gateframe distinguishes four failure types instead of binary pass/fail.

HARD_FAIL, Stop. The output violates a hard constraint that cannot be auto-recovered.

# Model chose an action outside its authorized scope
BoundaryRule(
    check=AllowedValues("action", {"treat", "observe", "refer"}),
    failure_mode=FailureMode.HARD_FAIL,  # default for BoundaryRule
)

SOFT_FAIL, Flag and continue with degraded confidence. Something is off but not critical enough to halt.

# Model confidence is low, continue but track the degradation
ConfidenceRule(
    field="confidence",
    minimum=0.7,
    failure_mode=FailureMode.SOFT_FAIL,  # default for ConfidenceRule
)

RETRY, Re-prompt with the failure context. The output is likely fixable by trying again.

# Malformed output that might parse correctly on a second attempt
StructuralRule(schema=MyOutput, failure_mode=FailureMode.RETRY)

SILENT_FAIL, The most dangerous kind. The output looks valid but violates a semantic or boundary rule. gateframe makes these visible instead of letting them pass through undetected.

SemanticRule(
    check=lambda output, **ctx: output["severity"] != "low" or output["escalated"] is False,
    failure_mode=FailureMode.SILENT_FAIL,
    failure_message="Low-severity cases should not be auto-escalated.",
)

Multi-step workflow validation

Validation state carries forward across steps. A soft failure in step 2 degrades the confidence score that step 4 sees.

from gateframe import WorkflowContext, ValidationContract, EscalationRouter
from gateframe.audit.log import AuditLog

ctx = WorkflowContext(workflow_id="incident_response_001", escalation_threshold=0.5)
router = EscalationRouter()
audit = AuditLog()

# Step 1
result1 = contract_step1.validate(output1)
ctx.update(result1)
audit.record(result1, workflow_context=ctx)

# Step 2, ctx carries forward degraded confidence from step 1
result2 = contract_step2.validate(output2)
ctx.update(result2)
audit.record(result2, workflow_context=ctx)

print(ctx.confidence)           # degraded from 1.0 by soft failures
print(ctx.threshold_breached)   # True if confidence < escalation_threshold

if ctx.threshold_breached:
    escalation = router.route_threshold_breach(ctx)
    print(escalation.route.value)  # "human_review", "abort", etc.

Provider integrations

gateframe validates outputs from any provider. Integrations are thin wrappers, gateframe does not import any LLM SDK at the core level.

# OpenAI
from gateframe.integrations.openai import OpenAIValidator
validator = OpenAIValidator(contract, parse_json=True)
result = validator.validate(openai_completion)

# Anthropic
from gateframe.integrations.anthropic import AnthropicValidator
validator = AnthropicValidator(contract, parse_json=True)
result = validator.validate(anthropic_message)

# LiteLLM
from gateframe.integrations.litellm import LiteLLMValidator
validator = LiteLLMValidator(contract, parse_json=True)
result = validator.validate(litellm_response)

# LangChain
from gateframe.integrations.langchain import LangChainValidator
validator = LangChainValidator(contract, parse_json=False)
result = validator.validate(chain_output)

Install the integration you need:

pip install "gateframe[openai]"
pip install "gateframe[anthropic]"
pip install "gateframe[litellm]"
pip install "gateframe[langchain]"

Audit trail

Every validation event is logged with structured context. Use the built-in exporters or implement your own.

from gateframe.audit.log import AuditLog
from gateframe.audit.exporters import JsonFileExporter

audit = AuditLog(exporters=[JsonFileExporter("audit.jsonl")])
audit.record(result, workflow_context=ctx)
audit.flush()

Each entry includes: timestamp, contract name, rules applied, rules failed, failure details, workflow ID, and accumulated confidence score.


When to use gateframe

Use it when:

  • You need to validate LLM output behavior beyond schema checks, decision boundaries, scope enforcement, semantic constraints
  • You need structured, recoverable failure records rather than bare exceptions
  • You're running multi-step workflows where soft failures in early steps should affect confidence downstream
  • You need an audit trail for post-incident debugging

Don't use it when:

  • You only need schema extraction from LLM outputs, use Instructor or Pydantic AI
  • You need offline model evaluation or benchmarking, use DeepEval or RAGAS
  • You need content safety filtering, use a dedicated guardrails tool

Installation

pip install gateframe

For development:

git clone https://github.com/practicalmind-ai/gateframe.git
cd gateframe
pip install -e ".[dev]"
python -m pytest tests/ -v

Examples

triage_workflow, 3-step medical triage pipeline. Demonstrates StructuralRule, BoundaryRule, ConfidenceRule, and WorkflowContext together. Step 2 has confidence below threshold, shows how SOFT_FAIL degrades the workflow score without halting it.

rag_output, RAG answer validation with two scenarios. Scenario B demonstrates simultaneous soft failures (low confidence + ungrounded answer) and how they accumulate in the workflow context.

agent_pipeline, 4-step agent workflow with escalation. Demonstrates how multiple soft failures across steps push cumulative confidence below the escalation threshold.


CLI

# Inspect a contract file, lists all contracts and their rules
gateframe inspect contracts.py

# Replay an audit log
gateframe replay audit.jsonl

License

MIT, see LICENSE.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%