Stop prompting. Start engineering. A structured reference for taking AI agents into production.
A curated map of agentic AI systems — covering architectures, frameworks, memory, evaluation, and safety.
This is not a list of tools.
We reject "tool list energy." It is a structured guide to building reliable, observable, and production-grade agentic systems, rigorously evaluated against engineering dimensions.
- 🧭 Thesis
- ⚖️ Architecture Decision Guide
- 🧩 Core Agentic Patterns
- 🏗️ Reference Architectures
- 🧠 Memory Systems
- 📊 Formal Evaluation Rubric
- Benchmark and Evidence Policy
- ⚙️ Orchestration Frameworks
- 📡 Protocols and Standards
- 🧪 Evaluation & Safety
- 🧠 Skills and Operating Principles
- 🚫 What NOT to Do
- 📊 Signals (How to Read This List)
- 🚀 Getting Started
- 🤝 Contributing
- 📌 Final Note
- 🌐 Browser and Desktop Agents
- 🎙 Voice Agents
- 🎨 Creative AI
- 💼 Customer Support and CRM Agents
- 🧠 Open-Source Models for Agents
- 📰 Newsletters and Communities
- 📚 Learning Resources
- ⚡ Fast-Moving Product Lists
| 📈 The Shift (Agentic systems are moving to) | 📉 The Challenge (Implementations suffer from) | 🎯 Our Focus (This repository prioritises) |
|---|---|---|
| • Stateful, multi-step reasoning • Multi-agent collaboration & orchestration • Feedback-driven learning loops • Tool-augmented execution environments |
• Fragility under iteration • Poor observability & evaluation • Weak memory & context management • Limited safety & governance |
• Reliability over novelty • Evaluation over intuition • Architecture over tooling • Systems thinking over prompt engineering |
| If your task is... | Start with... | Escalate to... | Avoid... |
|---|---|---|---|
| bounded, tool-using, low-risk | single-agent + tools | typed state, retries | multi-agent teams |
| long-running, inspectable, enterprise | graph/workflow orchestration | approval gates, persistence | opaque emergent loops |
| open-ended research | planner/executor or supervisor | critique loops, memory | rigid pipelines only |
| high-reliability extraction | prompt chains + strict schemas | validator feedback loops | unconstrained conversational agents |
| complex parallel execution | modular multi-agent setups | shared workspace/memory | treating LLMs as deterministic |
These patterns underpin most production-grade agentic systems.
| Pattern | Description | Key Characteristic |
|---|---|---|
| Single-Agent + Tool Use | One reasoning loop with structured tool invocation | Suited to focused tasks with bounded scope |
| Supervisor / Router Agents | Central agent delegates tasks to specialised agents | Enables modularity and scalability |
| Multi-Agent Collaboration | Agents operate in parallel or sequence | Patterns: debate, critique, planning/execution split |
| Reflection / Critique Loops | Agents evaluate and refine their own outputs | Improves reliability over multiple iterations |
| Retrieval-Augmented Agents | External knowledge via vector search or APIs | Reduces hallucination and improves grounding |
| Event-Driven / Long-Running Agents | Persistent agents reacting to triggers over time | Requires memory, state, and orchestration |
Representative system designs for real-world use.
| Architecture | Ecosystem Maturity | Description | Architectural Strengths | Operational Constraints | Workload Suitability | Design Paradigm | Governance Fit |
|---|---|---|---|---|---|---|---|
| DeerFlow | Emerging | Is: Open-source orchestration system combining sub-agents, memory, and sandboxes. Demonstrates: Workflow-oriented orchestration across agents with shared execution context. |
Strong system-level reference for memory, sandbox, and skills composition. | Higher setup complexity and a heavier runtime surface than most teams need initially. | Strong fit for compound research/coding workflows and teams studying full-stack agent architectures. Poor fit for lightweight orchestration or narrowly scoped tasks. | Hierarchical multi-agent orchestration. | Requires explicit sandbox policy, tool boundaries, and operator oversight before untrusted code execution. |
| SWE-agent | Experimental | Is: Autonomous SWE system using a specialized Agent-Computer Interface (ACI). Demonstrates: Narrow action spaces and interface design tuned for code-repair tasks. |
Streamlined command space, compressed history handling, and a clear task boundary for patch workflows. | Benchmark-oriented design, high token cost, and long end-to-end fix latency on larger tasks. | Strong fit for isolated PRs and self-contained bug fixes. Poor fit for broad refactors or environments without standard build tooling. | Single agent with a highly specialized action space (ACI). | Needs tight repository scoping, review gates, and execution controls to reduce silent code regressions. |
Memory is a first-class concern in agentic systems. Rather than treating memory as a simple array of previous messages, production systems require structured approaches to state, persistence, and retrieval.
Different types of memory serve distinct functional roles in an agentic architecture:
| Type | Definition | Implementation Examples |
|---|---|---|
| Working Memory (Thread State) | Short-term context for the current execution loop or active conversation thread. Ephemeral. | Context window, LangGraph State, in-memory message lists. |
| Episodic Memory | Autobiographical history of past actions, inputs, and outcomes. Enables reflection on past mistakes. | Checkpoint logs, event stores, prompt / trajectory histories. |
| Procedural Memory | Reusable skills, system prompts, and tool configurations. Defines how the agent operates. | Static configuration, retrieved skill libraries, GitHub workflows. |
| Semantic Memory | Embedded, factual knowledge about the world, the user, or the domain. Defines what the agent knows. | Vector databases (FAISS, Pinecone), knowledge graphs, Letta core memory. |
In multi-agent systems, memory boundaries are architectural decisions:
- Private Agent Memory: Each agent maintains its own semantic and episodic stores. Prevents context leakage and maintains strong role boundaries.
- Shared Workspace (Global Memory): A common blackboard or shared state where multiple agents read and write. Requires collision management and strict typing.
Managing the memory lifecycle is critical for long-running agents.
| Mechanism | Description | Best Practices & Risks |
|---|---|---|
| Checkpointing | Saving the exact thread state at a specific point in time (e.g., node transitions). | Enables "time travel" (rewind and replay) and human-in-the-loop approvals. |
| Write Policies | Rules defining when and how an agent commits data to long-term storage. | Prefer explicit SaveMemory tool calls over passive auto-saving to maintain control. |
| Retrieval Triggers | Determining when to query past memory (e.g., pre-fetch vs. just-in-time). | Use vector search for semantic recall, but use explicit graph keys for structured state. |
| Summarisation / Compression | Reducing token counts of episodic histories. | Summarise older interactions into a rolling summary while preserving recent exact messages. |
| Pruning / Decay | Deleting or archiving old or irrelevant memories. | Implement TTL (time-to-live) for working memory to prevent context exhaustion. |
| Contamination / Poisoning | Malicious or incorrect data persisting in long-term memory. | Risk: Once poisoned, an agent's future logic breaks. Require validation or bounds on semantic writes. |
Specialised infrastructure for managing agent memory.
| System | Role | Description |
|---|---|---|
| LangGraph Persistence | Thread-level state | Built-in check-pointers (SQLite, Postgres) for DAG-based execution loops, enabling interrupt/resume. |
| LangMem | Long-term memory extraction | LangChain's framework for extracting user preferences and entity profiles in the background. |
| Letta (formerly MemGPT) | OS-level memory abstraction | Advanced core memory management with explicit paging (read/write limits) to mimic virtual memory. |
| Mem0 | Personalized memory layer | Managed memory API focusing on user contexts, interactions, and entity relationships. |
| Zep / Graphiti | Enterprise memory & graphs | Fast, long-term memory for AI assistants; uses temporal knowledge graphs to map entity relationships over time. |
| MCP (Model Context Protocol) | Interoperability fabric | While not a DB itself, MCP provides a standard protocol to expose memory stores and file systems universally across tools and agents. |
Every major framework and architecture in this repository is judged against the following Required Scoring Dimensions. We evaluate systems based on engineering rigor, not marketing copy.
| Dimension | Evaluation Criteria |
|---|---|
| Control flow explicitness | How observable and deterministic is the execution path? |
| State model | How is agent state typed, managed, and persisted? |
| Memory support | Are there built-in primitives for short-term, episodic, and semantic memory? |
| Observability / tracing | Is it easy to trace intermediate reasoning steps and tool calls? |
| Human-in-the-loop support | Does it natively support interrupt-and-resume or approval gates? |
| Type safety / structured outputs | Are outputs guaranteed against strict schemas? |
| Provider portability | How tightly coupled is it to one specific LLM provider? |
| Security posture | Are there built-in mechanisms for sandboxing, access control, or guardrails? |
| Architectural strengths | Which design choices materially improve decomposition, control, state handling, or interface clarity? |
| Operational constraints | What deployment burden, runtime cost, debugging friction, or failure modes does it introduce? |
| Ecosystem maturity | How stable are the APIs, docs, integrations, and operator knowledge base? |
| Governance fit | Does it support auditability, approval gates, access boundaries, policy enforcement, and regulated environments? |
| Workload suitability | Which workflows, task shapes, and team contexts does it fit well or poorly? |
Canonical resources are trusted here because they define what counts as evidence. Prefer official docs, architecture guides, papers, benchmark repos, and first-party repositories when establishing capabilities, methodology, or interface details.
| Evidence Tag | Use For |
|---|---|
[official] |
Official docs, architecture guides, specifications, benchmark documentation, or first-party repositories. |
[benchmark] |
Published benchmark runs, evaluation papers, or benchmark repos tied to a named workload. |
[field report] |
Production write-ups, incident reports, engineering blogs, or operator notes about real deployments. |
[author assessment] |
This repository's synthesis after reviewing the sources above and applying the rubric. |
- Do not treat marketing copy, launch-day demos, or GitHub stars as sufficient evidence for production claims.
- Separate benchmark performance from production maturity. A benchmark result can support workload fit, but it does not by itself prove reliability, governance fit, cost control, or operational maturity.
- Record
Last reviewed: Month YYYYin rapidly changing sections such as product lists, vendor capability summaries, and release-sensitive guidance. - See appendix/benchmark-and-evidence-policy.md for the full policy.
| Framework | Ecosystem Maturity | Description | Architectural Strengths | Operational Constraints | Workload Suitability | Design Paradigm | Governance Fit |
|---|---|---|---|---|---|---|---|
| LangGraph | Production-ready | Is: Stateful orchestration framework building directed acyclic graphs (DAGs). Demonstrates: Deterministic execution control mixed with LLM reasoning. |
Explicit state management, persistence, and support for complex multi-actor workflows. | Verbose abstractions, steep learning curve, and graph sprawl if the workflow is over-modeled. | Strong fit for multi-step, stateful, and interruptible agent systems. Poor fit for simple single-prompt completions or linear chains. | DAG-based state machine. | Good fit for auditable workflows and approval gates, but graph edges must be tightly constrained to avoid runaway loops. |
| CrewAI | Emerging | Is: Multi-agent collaboration framework where agents are assigned roles, goals, and tools. Demonstrates: Role-based agentic workflows. |
Simple mental model and fast team-based decomposition for prototypes. | Less control for highly complex or non-standard systems. | Strong fit for rapid prototyping of agent teams. Poor fit for deterministic execution, rigorous type safety, or custom orchestration loops. | Role-based sequential or hierarchical process execution. | Requires added guardrails and observability to manage emergent loops and inconsistent agent behaviour. |
| OpenAI Assistants / Agents APIs | Production-ready | Is: Hosted orchestration and state management by OpenAI. Demonstrates: Managed state and tool execution. |
Integrated tools, simplified operations, and reduced infrastructure ownership. | Limited transparency and control, with strong provider coupling. | Strong fit for managed environments and teams optimizing for delivery speed. Poor fit for provider portability, local models, or complex multi-agent setups. | Hosted black-box orchestration. | Viable for hosted approval flows, but bounded by vendor policy, uptime, and data-handling constraints. |
| Pydantic AI | Production-ready | Is: Framework built directly on Pydantic enforcing strict data validation and type-safe outputs from LLMs. Demonstrates: Type-driven agentic execution and dependency injection. |
Strong type-system integration, schema enforcement, dependency injection, and retry support. | Smaller surrounding ecosystem than older orchestration stacks; retry loops can increase latency and cost. | Strong fit for production systems needing strict type safety and predictable parsing. Poor fit for open-ended generative writing or weakly structured tasks. | Strongly typed, schema-first LLM interactions. | Good fit where schema validation and dependency control matter, but retry policies need explicit cost and failure bounds. |
| Smolagents | Emerging | Is: Minimalist framework using CodeAgents (Python logic code generation over JSON calling).Demonstrates: Code-first model execution bounds. |
Lightweight core and direct execution model that stays close to Python control flow. | Weak typed-state enforcement and high exposure if generated code runs with broad permissions. | Strong fit for fast prototyping and Python-native experimentation. Poor fit for regulated networks or systems that need strict sandboxing and observability. | Python-native logic execution via LLM generation. | Requires strong sandboxing, network controls, and review boundaries before production use. |
| Framework | Lang | Description |
|---|---|---|
| LangChain | Py/JS | Modular framework with chains, tools, memory, and broad integration coverage. |
| LangGraph | Py/JS | Graph-based orchestration. Stateful directed graphs. |
| LlamaIndex | Py/JS | Data-centric framework for retrieval-heavy and RAG-oriented agent systems. |
| Haystack | Py | Pipeline-based. Search and retrieval. |
| Semantic Kernel | C#/Py/Java | Microsoft enterprise. Azure integration. |
| Pydantic AI | Py | Type-safe. Clean Pythonic API. Production-ready. |
| DSPy | Py | Stanford. Programming not prompting. Auto-optimizes. |
| Mastra | TS | TypeScript-first. Observational Memory. Apache 2.0. |
| Anthropic SDK | Py/TS | Official Claude SDK. Tool use, computer control, streaming. |
| Framework | Lang | Description |
|---|---|---|
| AutoGen | Py | Microsoft multi-agent conversations. |
| CrewAI | Py | Role-based crew members with goals and tools. |
| MetaGPT | Py | PM, architect, engineer roles. Software company sim. |
| OpenAI Agents SDK | Py | Official. Multi-step agents with handoffs. |
| Google ADK | Py | Native Gemini. Multi-agent orchestration. |
| Strands Agents | Py | AWS-backed. Model-driven tool use. |
| CAMEL | Py | Role-based simulation. Collaborative reasoning. |
| AutoGPT | Py | Pioneer. Now full platform with visual builder. |
| AgentScope | Py | Alibaba multi-agent framework. |
| DeerFlow | Py | ByteDance orchestration system for planning, tools, memory, and execution. |
| Framework | Lang | Description |
|---|---|---|
| Smolagents | Py | HuggingFace minimal agents. ~1000 lines. |
| Agno | Py | Lightweight, model-agnostic. |
| Upsonic | Py | MCP support. Minimal setup. |
| Portia AI | Py | Reliable agents in production. |
| MicroAgent | Py | Self-editing prompts and code. |
| Protocol | Description |
|---|---|
| MCP (Model Context Protocol) | Open standard for exposing tools, memory, and file systems to agents. |
| A2A (Agent-to-Agent) | Google protocol for inter-agent communication. |
| OpenAI Function Calling | OpenAI native tool-use. JSON schema. |
| Tool Use (Anthropic) | Claude native tool-use. Structured JSON. |
| OpenAPI | Industry-standard API spec. Foundation for agent tools. |
This section covers frameworks and operational tooling for testing agent quality, correctness, task completion, regressions, and system behaviour, as well as security scanning, red teaming, policy testing, and misalignment research.
- Output correctness
- Reasoning quality
- Tool-use accuracy
- Latency and cost
- Robustness under adversarial input
| Framework | Description | Methodology / Workload Suitability |
|---|---|---|
| OpenAI Evals | Core framework for testing and improving AI systems. | Foundational evaluation framework and methodology. |
| DeepEval | Dedicated open-source LLM evaluation framework with metrics for hallucination, answer relevance, task completion, etc. | Application-level evaluation and regression testing. |
| promptfoo | CLI and library for evaluation and red teaming of LLM apps. | Regression testing, prompt/application evals, adversarial testing. |
| Inspect | UK AI Security Institute's framework for rigorous LLM evals covering coding, reasoning, agent behavior, and model-graded scoring. | Rigorous research-grade and agent-task evaluation. |
- Golden datasets
- Regression testing
- Adversarial / red-team inputs
- Continuous evaluation pipelines
| Tool | Description |
|---|---|
| Langfuse | OSS LLM observability. Traces, evals, prompts. |
| LangSmith | LangChain platform. Tracing, testing, evaluation. |
| Braintrust | Eval-driven development. Experiment tracking. |
| Arize Phoenix | OSS AI observability. Traces, evals, embeddings. |
| Helicone | OSS LLM observability. One-line integration. |
| Weights and Biases Weave | Trace and evaluate LLM apps. |
| Benchmark | Description |
|---|---|
| SWE-bench | Coding-agent benchmark grounded in real GitHub issues and patches. |
| AgentBench | 8-environment LLM agent benchmark. |
| Terminal-Bench | Evaluates terminal-agent execution on shell-based tasks. |
| GAIA | General AI Assistant. Real-world tasks. |
| WebArena | Web agent benchmark. Real websites. |
| 🛡️ Mitigation Strategies | |
|---|---|
| Prompt injection (direct & indirect) | Input validation and filtering |
| Tool misuse | Tool permissioning and sandboxing |
| Data exfiltration | Human-in-the-loop approval gates |
| Memory poisoning | Audit logs and traceability |
| Unbounded autonomous behaviour | Policy-driven execution |
| Resource | Description | Workload Suitability | Official Link |
|---|---|---|---|
| garak | LLM vulnerability scanner probing for hallucination, leakage, injection, toxicity, and jailbreaks. | Automated red teaming & vulnerability scanning | GitHub |
| OWASP GenAI Security Project | Governance and mitigation framework for safety risks in LLMs and agentic systems. | Governance, controls, and secure-design reference | Project Home |
| Anthropic Alignment Stress-Testing | Research and operational approach for deliberately stress-testing alignment evals and oversight. | Research-driven safety evaluation methodology | Post |
| Model Organisms of Misalignment | In-vitro demonstrations of alignment failures so they can be studied empirically. | Advanced safety research and methodology | Post |
| AI Safety via Debate | Alignment framework for cases where direct human supervision is too hard. | Alignment and scalable oversight resource | Paper |
| Concrete Problems in AI Safety | Foundational framing paper for safety problems (side effects, reward hacking, safe exploration, shift). | Foundational safety resource | Paper |
| Anthropic Agentic Misalignment | Grounds safety concerns in concrete behaviours (blackmail, espionage) in simulated settings. | Applied safety & threat-modelling reference | Research Post |
| Tool | Description |
|---|---|
| Guardrails AI | Structural, type, quality guarantees for LLM outputs. |
| NeMo Guardrails | NVIDIA. Programmable conversation guardrails. |
| LLM Guard | Security toolkit. Input/output scanning. |
| Rebuff | Prompt injection detection. |
| Lakera Guard | Real-time protection. Prompt injection, data leakage, toxicity. |
Building agentic systems requires a shift in skillset:
- Problem decomposition
- System design and orchestration
- Tool and interface design
- Memory modelling
- Evaluation design
- Failure mode analysis
- Safety and governance thinking
To keep this repository genuinely opinionated, we advocate against these common anti-patterns:
- Do not begin with multi-agent systems when a single agent plus tools will do. Escalate to multi-agent only when task decomposition requires it.
- Do not add memory before defining what deserves persistence. Avoid "state bloat" by being intentional about what is stored and why.
- Do not treat tracing as optional for long-running systems. Observability is the only way to debug non-deterministic agentic failures.
- Do not confuse benchmark wins with production readiness. Real-world reliability requires evaluation on your specific data and edge cases.
- Do not use framework abstractions as a substitute for architecture. Understand your control flow before outsourcing it to a library.
- ⭐ Production-grade
- 🧪 Experimental
⚠️ Early-stage / unstable
- Choose a core pattern (e.g. single-agent + tools)
- Add structured tool use
- Introduce evaluation early
- Layer in memory only when needed
- Expand into multi-agent systems with clear roles
- Add observability and safety constraints
Contributions are welcome! Please read the CONTRIBUTING.md for full details before submitting a pull request.
At a high level, submissions must meet the following criteria:
- Clear description of purpose
- Architectural strengths and operational constraints
- Governance fit and workload suitability
- Evidence of ecosystem maturity or real-world usage (preferred)
- Evidence tags and
Last reviewedmarkers where claims are time-sensitive or likely to change
This is a curated list, not an exhaustive one.
See appendix/benchmark-and-evidence-policy.md for the sourcing, evidence-tagging, and Last reviewed policy.
The shift to agentic systems is not about more tools.
It is about:
- Designing systems that can reason, act, evaluate, and improve
- Ensuring those systems are reliable, observable, and safe
Build accordingly.