🧠 Awesome Agentic Engineering

Stop prompting. Start engineering. A structured reference for taking AI agents into production.

A curated map of agentic AI systems — covering architectures, frameworks, memory, evaluation, and safety.

This is not a list of tools.

We reject "tool list energy." It is a structured guide to building reliable, observable, and production-grade agentic systems, rigorously evaluated against engineering dimensions.

📑 Table of Contents

🧭 Thesis
⚖️ Architecture Decision Guide
🧩 Core Agentic Patterns
🏗️ Reference Architectures
🧠 Memory Systems
📊 Formal Evaluation Rubric
Benchmark and Evidence Policy
⚙️ Orchestration Frameworks
📡 Protocols and Standards
🧪 Evaluation & Safety
🧠 Skills and Operating Principles
🚫 What NOT to Do
📊 Signals (How to Read This List)
🚀 Getting Started
🤝 Contributing
📌 Final Note

📂 Appendix

🧭 Thesis

📈 The Shift (Agentic systems are moving to)	📉 The Challenge (Implementations suffer from)	🎯 Our Focus (This repository prioritises)
• Stateful, multi-step reasoning • Multi-agent collaboration & orchestration • Feedback-driven learning loops • Tool-augmented execution environments	• Fragility under iteration • Poor observability & evaluation • Weak memory & context management • Limited safety & governance	• Reliability over novelty • Evaluation over intuition • Architecture over tooling • Systems thinking over prompt engineering

⚖️ Architecture Decision Guide

If your task is...	Start with...	Escalate to...	Avoid...
bounded, tool-using, low-risk	single-agent + tools	typed state, retries	multi-agent teams
long-running, inspectable, enterprise	graph/workflow orchestration	approval gates, persistence	opaque emergent loops
open-ended research	planner/executor or supervisor	critique loops, memory	rigid pipelines only
high-reliability extraction	prompt chains + strict schemas	validator feedback loops	unconstrained conversational agents
complex parallel execution	modular multi-agent setups	shared workspace/memory	treating LLMs as deterministic

🧩 Core Agentic Patterns

These patterns underpin most production-grade agentic systems.

Pattern	Description	Key Characteristic
Single-Agent + Tool Use	One reasoning loop with structured tool invocation	Suited to focused tasks with bounded scope
Supervisor / Router Agents	Central agent delegates tasks to specialised agents	Enables modularity and scalability
Multi-Agent Collaboration	Agents operate in parallel or sequence	Patterns: debate, critique, planning/execution split
Reflection / Critique Loops	Agents evaluate and refine their own outputs	Improves reliability over multiple iterations
Retrieval-Augmented Agents	External knowledge via vector search or APIs	Reduces hallucination and improves grounding
Event-Driven / Long-Running Agents	Persistent agents reacting to triggers over time	Requires memory, state, and orchestration

🏗️ Reference Architectures

Representative system designs for real-world use.

Architecture	Ecosystem Maturity	Description	Architectural Strengths	Operational Constraints	Workload Suitability	Design Paradigm	Governance Fit
DeerFlow	Emerging	Is: Open-source orchestration system combining sub-agents, memory, and sandboxes. Demonstrates: Workflow-oriented orchestration across agents with shared execution context.	Strong system-level reference for memory, sandbox, and skills composition.	Higher setup complexity and a heavier runtime surface than most teams need initially.	Strong fit for compound research/coding workflows and teams studying full-stack agent architectures. Poor fit for lightweight orchestration or narrowly scoped tasks.	Hierarchical multi-agent orchestration.	Requires explicit sandbox policy, tool boundaries, and operator oversight before untrusted code execution.
SWE-agent	Experimental	Is: Autonomous SWE system using a specialized Agent-Computer Interface (ACI). Demonstrates: Narrow action spaces and interface design tuned for code-repair tasks.	Streamlined command space, compressed history handling, and a clear task boundary for patch workflows.	Benchmark-oriented design, high token cost, and long end-to-end fix latency on larger tasks.	Strong fit for isolated PRs and self-contained bug fixes. Poor fit for broad refactors or environments without standard build tooling.	Single agent with a highly specialized action space (ACI).	Needs tight repository scoping, review gates, and execution controls to reduce silent code regressions.

🧠 Memory Systems

Memory is a first-class concern in agentic systems. Rather than treating memory as a simple array of previous messages, production systems require structured approaches to state, persistence, and retrieval.

Memory Taxonomy

Different types of memory serve distinct functional roles in an agentic architecture:

Type	Definition	Implementation Examples
Working Memory (Thread State)	Short-term context for the current execution loop or active conversation thread. Ephemeral.	Context window, LangGraph `State`, in-memory message lists.
Episodic Memory	Autobiographical history of past actions, inputs, and outcomes. Enables reflection on past mistakes.	Checkpoint logs, event stores, prompt / trajectory histories.
Procedural Memory	Reusable skills, system prompts, and tool configurations. Defines how the agent operates.	Static configuration, retrieved skill libraries, GitHub workflows.
Semantic Memory	Embedded, factual knowledge about the world, the user, or the domain. Defines what the agent knows.	Vector databases (FAISS, Pinecone), knowledge graphs, Letta core memory.

Architectural Patterns: Shared vs. Private Memory

In multi-agent systems, memory boundaries are architectural decisions:

Private Agent Memory: Each agent maintains its own semantic and episodic stores. Prevents context leakage and maintains strong role boundaries.
Shared Workspace (Global Memory): A common blackboard or shared state where multiple agents read and write. Requires collision management and strict typing.

Retrieval and Persistence Decisions

Managing the memory lifecycle is critical for long-running agents.

Mechanism	Description	Best Practices & Risks
Checkpointing	Saving the exact thread state at a specific point in time (e.g., node transitions).	Enables "time travel" (rewind and replay) and human-in-the-loop approvals.
Write Policies	Rules defining when and how an agent commits data to long-term storage.	Prefer explicit `SaveMemory` tool calls over passive auto-saving to maintain control.
Retrieval Triggers	Determining when to query past memory (e.g., pre-fetch vs. just-in-time).	Use vector search for semantic recall, but use explicit graph keys for structured state.
Summarisation / Compression	Reducing token counts of episodic histories.	Summarise older interactions into a rolling summary while preserving recent exact messages.
Pruning / Decay	Deleting or archiving old or irrelevant memories.	Implement TTL (time-to-live) for working memory to prevent context exhaustion.
Contamination / Poisoning	Malicious or incorrect data persisting in long-term memory.	Risk: Once poisoned, an agent's future logic breaks. Require validation or bounds on semantic writes.

Systems and Protocols

Specialised infrastructure for managing agent memory.

System	Role	Description
LangGraph Persistence	Thread-level state	Built-in check-pointers (SQLite, Postgres) for DAG-based execution loops, enabling interrupt/resume.
LangMem	Long-term memory extraction	LangChain's framework for extracting user preferences and entity profiles in the background.
Letta (formerly MemGPT)	OS-level memory abstraction	Advanced core memory management with explicit paging (read/write limits) to mimic virtual memory.
Mem0	Personalized memory layer	Managed memory API focusing on user contexts, interactions, and entity relationships.
Zep / Graphiti	Enterprise memory & graphs	Fast, long-term memory for AI assistants; uses temporal knowledge graphs to map entity relationships over time.
MCP (Model Context Protocol)	Interoperability fabric	While not a DB itself, MCP provides a standard protocol to expose memory stores and file systems universally across tools and agents.

📊 Formal Evaluation Rubric

Every major framework and architecture in this repository is judged against the following Required Scoring Dimensions. We evaluate systems based on engineering rigor, not marketing copy.

Dimension	Evaluation Criteria
Control flow explicitness	How observable and deterministic is the execution path?
State model	How is agent state typed, managed, and persisted?
Memory support	Are there built-in primitives for short-term, episodic, and semantic memory?
Observability / tracing	Is it easy to trace intermediate reasoning steps and tool calls?
Human-in-the-loop support	Does it natively support interrupt-and-resume or approval gates?
Type safety / structured outputs	Are outputs guaranteed against strict schemas?
Provider portability	How tightly coupled is it to one specific LLM provider?
Security posture	Are there built-in mechanisms for sandboxing, access control, or guardrails?
Architectural strengths	Which design choices materially improve decomposition, control, state handling, or interface clarity?
Operational constraints	What deployment burden, runtime cost, debugging friction, or failure modes does it introduce?
Ecosystem maturity	How stable are the APIs, docs, integrations, and operator knowledge base?
Governance fit	Does it support auditability, approval gates, access boundaries, policy enforcement, and regulated environments?
Workload suitability	Which workflows, task shapes, and team contexts does it fit well or poorly?

Benchmark and Evidence Policy

Canonical resources are trusted here because they define what counts as evidence. Prefer official docs, architecture guides, papers, benchmark repos, and first-party repositories when establishing capabilities, methodology, or interface details.

Evidence Tag	Use For
`[official]`	Official docs, architecture guides, specifications, benchmark documentation, or first-party repositories.
`[benchmark]`	Published benchmark runs, evaluation papers, or benchmark repos tied to a named workload.
`[field report]`	Production write-ups, incident reports, engineering blogs, or operator notes about real deployments.
`[author assessment]`	This repository's synthesis after reviewing the sources above and applying the rubric.

Do not treat marketing copy, launch-day demos, or GitHub stars as sufficient evidence for production claims.
Separate benchmark performance from production maturity. A benchmark result can support workload fit, but it does not by itself prove reliability, governance fit, cost control, or operational maturity.
Record Last reviewed: Month YYYY in rapidly changing sections such as product lists, vendor capability summaries, and release-sensitive guidance.
See appendix/benchmark-and-evidence-policy.md for the full policy.

⚙️ Orchestration Frameworks

Deep Dives

Framework	Ecosystem Maturity	Description	Architectural Strengths	Operational Constraints	Workload Suitability	Design Paradigm	Governance Fit
LangGraph	Production-ready	Is: Stateful orchestration framework building directed acyclic graphs (DAGs). Demonstrates: Deterministic execution control mixed with LLM reasoning.	Explicit state management, persistence, and support for complex multi-actor workflows.	Verbose abstractions, steep learning curve, and graph sprawl if the workflow is over-modeled.	Strong fit for multi-step, stateful, and interruptible agent systems. Poor fit for simple single-prompt completions or linear chains.	DAG-based state machine.	Good fit for auditable workflows and approval gates, but graph edges must be tightly constrained to avoid runaway loops.
CrewAI	Emerging	Is: Multi-agent collaboration framework where agents are assigned roles, goals, and tools. Demonstrates: Role-based agentic workflows.	Simple mental model and fast team-based decomposition for prototypes.	Less control for highly complex or non-standard systems.	Strong fit for rapid prototyping of agent teams. Poor fit for deterministic execution, rigorous type safety, or custom orchestration loops.	Role-based sequential or hierarchical process execution.	Requires added guardrails and observability to manage emergent loops and inconsistent agent behaviour.
OpenAI Assistants / Agents APIs	Production-ready	Is: Hosted orchestration and state management by OpenAI. Demonstrates: Managed state and tool execution.	Integrated tools, simplified operations, and reduced infrastructure ownership.	Limited transparency and control, with strong provider coupling.	Strong fit for managed environments and teams optimizing for delivery speed. Poor fit for provider portability, local models, or complex multi-agent setups.	Hosted black-box orchestration.	Viable for hosted approval flows, but bounded by vendor policy, uptime, and data-handling constraints.
Pydantic AI	Production-ready	Is: Framework built directly on Pydantic enforcing strict data validation and type-safe outputs from LLMs. Demonstrates: Type-driven agentic execution and dependency injection.	Strong type-system integration, schema enforcement, dependency injection, and retry support.	Smaller surrounding ecosystem than older orchestration stacks; retry loops can increase latency and cost.	Strong fit for production systems needing strict type safety and predictable parsing. Poor fit for open-ended generative writing or weakly structured tasks.	Strongly typed, schema-first LLM interactions.	Good fit where schema validation and dependency control matter, but retry policies need explicit cost and failure bounds.
Smolagents	Emerging	Is: Minimalist framework using `CodeAgents` (Python logic code generation over JSON calling). Demonstrates: Code-first model execution bounds.	Lightweight core and direct execution model that stays close to Python control flow.	Weak typed-state enforcement and high exposure if generated code runs with broad permissions.	Strong fit for fast prototyping and Python-native experimentation. Poor fit for regulated networks or systems that need strict sandboxing and observability.	Python-native logic execution via LLM generation.	Requires strong sandboxing, network controls, and review boundaries before production use.

Frameworks Landscape

General Purpose

Framework	Lang	Description
LangChain	Py/JS	Modular framework with chains, tools, memory, and broad integration coverage.
LangGraph	Py/JS	Graph-based orchestration. Stateful directed graphs.
LlamaIndex	Py/JS	Data-centric framework for retrieval-heavy and RAG-oriented agent systems.
Haystack	Py	Pipeline-based. Search and retrieval.
Semantic Kernel	C#/Py/Java	Microsoft enterprise. Azure integration.
Pydantic AI	Py	Type-safe. Clean Pythonic API. Production-ready.
DSPy	Py	Stanford. Programming not prompting. Auto-optimizes.
Mastra	TS	TypeScript-first. Observational Memory. Apache 2.0.
Anthropic SDK	Py/TS	Official Claude SDK. Tool use, computer control, streaming.

Multi-Agent Orchestration

Framework	Lang	Description
AutoGen	Py	Microsoft multi-agent conversations.
CrewAI	Py	Role-based crew members with goals and tools.
MetaGPT	Py	PM, architect, engineer roles. Software company sim.
OpenAI Agents SDK	Py	Official. Multi-step agents with handoffs.
Google ADK	Py	Native Gemini. Multi-agent orchestration.
Strands Agents	Py	AWS-backed. Model-driven tool use.
CAMEL	Py	Role-based simulation. Collaborative reasoning.
AutoGPT	Py	Pioneer. Now full platform with visual builder.
AgentScope	Py	Alibaba multi-agent framework.
DeerFlow	Py	ByteDance orchestration system for planning, tools, memory, and execution.

Lightweight / Minimalist

Framework	Lang	Description
Smolagents	Py	HuggingFace minimal agents. ~1000 lines.
Agno	Py	Lightweight, model-agnostic.
Upsonic	Py	MCP support. Minimal setup.
Portia AI	Py	Reliable agents in production.
MicroAgent	Py	Self-editing prompts and code.

📡 Protocols and Standards

Protocol	Description
MCP (Model Context Protocol)	Open standard for exposing tools, memory, and file systems to agents.
A2A (Agent-to-Agent)	Google protocol for inter-agent communication.
OpenAI Function Calling	OpenAI native tool-use. JSON schema.
Tool Use (Anthropic)	Claude native tool-use. Structured JSON.
OpenAPI	Industry-standard API spec. Foundation for agent tools.

🧪 Evaluation & Safety

This section covers frameworks and operational tooling for testing agent quality, correctness, task completion, regressions, and system behaviour, as well as security scanning, red teaming, policy testing, and misalignment research.

Core Evaluation Areas

Output correctness
Reasoning quality
Tool-use accuracy
Latency and cost
Robustness under adversarial input

Evaluation Frameworks

Framework	Description	Methodology / Workload Suitability
OpenAI Evals	Core framework for testing and improving AI systems.	Foundational evaluation framework and methodology.
DeepEval	Dedicated open-source LLM evaluation framework with metrics for hallucination, answer relevance, task completion, etc.	Application-level evaluation and regression testing.
promptfoo	CLI and library for evaluation and red teaming of LLM apps.	Regression testing, prompt/application evals, adversarial testing.
Inspect	UK AI Security Institute's framework for rigorous LLM evals covering coding, reasoning, agent behavior, and model-graded scoring.	Rigorous research-grade and agent-task evaluation.

Key Practices

Golden datasets
Regression testing
Adversarial / red-team inputs
Continuous evaluation pipelines

Tracing and Monitoring

Tool	Description
Langfuse	OSS LLM observability. Traces, evals, prompts.
LangSmith	LangChain platform. Tracing, testing, evaluation.
Braintrust	Eval-driven development. Experiment tracking.
Arize Phoenix	OSS AI observability. Traces, evals, embeddings.
Helicone	OSS LLM observability. One-line integration.
Weights and Biases Weave	Trace and evaluate LLM apps.

Benchmarks

Benchmark	Description
SWE-bench	Coding-agent benchmark grounded in real GitHub issues and patches.
AgentBench	8-environment LLM agent benchmark.
Terminal-Bench	Evaluates terminal-agent execution on shell-based tasks.
GAIA	General AI Assistant. Real-world tasks.
WebArena	Web agent benchmark. Real websites.

Safety Risk Surfaces & Mitigations

⚠️ Core Risk Surfaces	🛡️ Mitigation Strategies
Prompt injection (direct & indirect)	Input validation and filtering
Tool misuse	Tool permissioning and sandboxing
Data exfiltration	Human-in-the-loop approval gates
Memory poisoning	Audit logs and traceability
Unbounded autonomous behaviour	Policy-driven execution

Safety Tooling & Methodologies

Resource	Description	Workload Suitability	Official Link
garak	LLM vulnerability scanner probing for hallucination, leakage, injection, toxicity, and jailbreaks.	Automated red teaming & vulnerability scanning	GitHub
OWASP GenAI Security Project	Governance and mitigation framework for safety risks in LLMs and agentic systems.	Governance, controls, and secure-design reference	Project Home
Anthropic Alignment Stress-Testing	Research and operational approach for deliberately stress-testing alignment evals and oversight.	Research-driven safety evaluation methodology	Post
Model Organisms of Misalignment	In-vitro demonstrations of alignment failures so they can be studied empirically.	Advanced safety research and methodology	Post
AI Safety via Debate	Alignment framework for cases where direct human supervision is too hard.	Alignment and scalable oversight resource	Paper
Concrete Problems in AI Safety	Foundational framing paper for safety problems (side effects, reward hacking, safe exploration, shift).	Foundational safety resource	Paper
Anthropic Agentic Misalignment	Grounds safety concerns in concrete behaviours (blackmail, espionage) in simulated settings.	Applied safety & threat-modelling reference	Research Post

AI Guardrails

Tool	Description
Guardrails AI	Structural, type, quality guarantees for LLM outputs.
NeMo Guardrails	NVIDIA. Programmable conversation guardrails.
LLM Guard	Security toolkit. Input/output scanning.
Rebuff	Prompt injection detection.
Lakera Guard	Real-time protection. Prompt injection, data leakage, toxicity.

🧠 Skills and Operating Principles

Building agentic systems requires a shift in skillset:

Problem decomposition
System design and orchestration
Tool and interface design
Memory modelling
Evaluation design
Failure mode analysis
Safety and governance thinking

🚫 What NOT to Do

To keep this repository genuinely opinionated, we advocate against these common anti-patterns:

Do not begin with multi-agent systems when a single agent plus tools will do. Escalate to multi-agent only when task decomposition requires it.
Do not add memory before defining what deserves persistence. Avoid "state bloat" by being intentional about what is stored and why.
Do not treat tracing as optional for long-running systems. Observability is the only way to debug non-deterministic agentic failures.
Do not confuse benchmark wins with production readiness. Real-world reliability requires evaluation on your specific data and edge cases.
Do not use framework abstractions as a substitute for architecture. Understand your control flow before outsourcing it to a library.

📊 Signals (How to Read This List)

⭐ Production-grade
🧪 Experimental
⚠️ Early-stage / unstable

🚀 Getting Started

Choose a core pattern (e.g. single-agent + tools)
Add structured tool use
Introduce evaluation early
Layer in memory only when needed
Expand into multi-agent systems with clear roles
Add observability and safety constraints

🤝 Contributing

Contributions are welcome! Please read the CONTRIBUTING.md for full details before submitting a pull request.

At a high level, submissions must meet the following criteria:

Clear description of purpose
Architectural strengths and operational constraints
Governance fit and workload suitability
Evidence of ecosystem maturity or real-world usage (preferred)
Evidence tags and Last reviewed markers where claims are time-sensitive or likely to change

This is a curated list, not an exhaustive one.

See appendix/benchmark-and-evidence-policy.md for the sourcing, evidence-tagging, and Last reviewed policy.

📌 Final Note

The shift to agentic systems is not about more tools.

It is about:

Designing systems that can reason, act, evaluate, and improve
Ensuring those systems are reliable, observable, and safe

Build accordingly.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
appendix		appendix
tasks		tasks
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 Awesome Agentic Engineering

📑 Table of Contents

📂 Appendix

🧭 Thesis

⚖️ Architecture Decision Guide

🧩 Core Agentic Patterns

🏗️ Reference Architectures

🧠 Memory Systems

Memory Taxonomy

Architectural Patterns: Shared vs. Private Memory

Retrieval and Persistence Decisions

Systems and Protocols

📊 Formal Evaluation Rubric

Benchmark and Evidence Policy

⚙️ Orchestration Frameworks

Deep Dives

Frameworks Landscape

General Purpose

Multi-Agent Orchestration

Lightweight / Minimalist

📡 Protocols and Standards

🧪 Evaluation & Safety

Core Evaluation Areas

Evaluation Frameworks

Key Practices

Tracing and Monitoring

Benchmarks

Safety Risk Surfaces & Mitigations

Safety Tooling & Methodologies

AI Guardrails

🧠 Skills and Operating Principles

🚫 What NOT to Do

📊 Signals (How to Read This List)

🚀 Getting Started

🤝 Contributing

📌 Final Note

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages