A Proof of Concept for building an intelligent company assistant that combines internal knowledge bases with web search capabilities. The primary goal of this project is to compare different agent architectures and evaluate their trade-offs in terms of accuracy, cost, and flexibility.
This project implements a "Company ChatGPT" β an AI assistant that can:
- Answer questions using internal company documents (Markdown knowledge base)
- Fetch real-time information via web search when needed
- Use its intrinsic knowledge for general queries
- Block harmful or inappropriate requests following safety guidelines
- Ask clarifying questions when queries are ambiguous
The plus here is the evaluation-driven development approach: I've built a repo to objectively compare 2 different agent architectures and select the optimal one based on accuracy and cost metrics.
This POC implements two distinct agent architectures to explore different approaches to agentic AI systems.
The goal of this POC is to empirically determine which architecture best fits the problem at hand. Each approach has theoretical trade-offs, but only real experiments can reveal which matters more in practice.
LangChain ReAct Agent was chosen because it requires significantly fewer lines of code to implement, the autonomous iteration system (action β observation β decide again) could theoretically improve accuracy on complex queries, and many components are built-in and battle-tested, reducing development time and potential bugs.
Classic Orchestrator was chosen because it provides full control over the execution flow with explicit, predictable steps, it allows integration with Instructor for structured outputs (enabling chain-of-thought control and automatic retries), and the problem domain is relatively simple β a company assistant doesn't need multi-step reasoning chains.
Both architectures were evaluated on the same 67 test cases across 5 categories using GPT-4.1-mini via OpenRouter. The results clearly favor the Classic approach for this use case.
Overall Comparison (excluding evaluation/judge costs):
| Metric | Classic | LangChain | Winner |
|---|---|---|---|
| Overall Accuracy | 98.51% | 91.04% | Classic (+7.5%) |
| Total Cost | $0.105 | $0.138 | Classic (-24%) |
| Total Tokens | 110,018 | 177,593 | Classic (-38%) |
Accuracy by Category:
| Category | Classic | LangChain |
|---|---|---|
| COMPANY | 100% (30/30) | 96.7% (29/30) |
| GENERAL | 100% (10/10) | 100% (10/10) |
| AMBIGUOUS | 100% (6/6) | 50% (3/6) |
| HARMFUL | 100% (10/10) | 90% (9/10) |
| WEB_SEARCH | 90.9% (10/11) | 90.9% (10/11) |
Web Search Provider Comparison (Classic architecture):
| Provider | Accuracy | Cost per Query | Notes |
|---|---|---|---|
| DuckDuckGo | 0% (0/11) | Free | Unusable β results too inconsistent |
| Serper | 72.7% (8/11) | ~$0.001 | Good balance of cost and quality |
| Perplexity | 90.9% (10/11) | ~$0.005 | Best accuracy, 5x more expensive |
Key Observations:
The Classic architecture significantly outperforms LangChain on ambiguous queries (100% vs 50%). This is likely because Instructor's structured output forces the orchestrator to explicitly choose "clarify" as an action, while LangChain's free-form reasoning sometimes attempts to answer ambiguous questions directly.
The Classic approach uses 38% fewer tokens despite having more LLM calls. This is because LangChain's ReAct loop includes verbose reasoning traces and sometimes performs unnecessary tool calls, inflating the context window on each iteration.
An important advantage of the Classic architecture is the ability to use different models for different stages. The orchestrator (routing) step requires fast, cheap decisions, while the generator step benefits from higher quality output. This means you could use a smaller model like GPT-4.1-mini for routing and a more capable model like GPT-4o for generation, optimizing the cost/quality trade-off. LangChain's ReAct agent uses a single model for the entire loop, limiting this flexibility.
For web search, DuckDuckGo proved completely unusable for this task. Serper provides a reasonable middle ground, while Perplexity delivers the best results at a higher cost. The final implementation uses Serper as the default, with Perplexity available for use cases requiring higher accuracy.
For this specific problem (a company assistant with a small knowledge base), the autonomous multi-step capability of LangChain provides no benefit β queries are simple enough that a single routing decision suffices.
A streamlined, single-pass architecture where an orchestrator decides the action, executes it, and generates the response.
flowchart TD
A[π€ User Input] --> B{π§ Orchestrator}
B -->|knowledge_base| C[π Search Documents]
B -->|web_search| D[π Web Search]
B -->|intrinsic| E[π LLM Knowledge]
B -->|clarify| F[β Ask Clarification]
B -->|blocked| G[π« Safety Block]
C --> H[π Generator]
D --> H
E --> I[π¬ Response]
F --> I
G --> I
H --> I
style B fill:#e1d5e7,stroke:#9673a6
style H fill:#d5e8d4,stroke:#82b366
style I fill:#1a1a2e,stroke:#16213e,color:#fff
How it works:
- The Orchestrator analyzes the query and decides which action to take
- A single tool is invoked (knowledge base, web search, or intrinsic knowledge)
- The Generator creates a response based on the retrieved context
- Response is returned to the user
Pros: Predictable costs, faster responses, easier to debug
Cons: Limited flexibility for complex multi-step queries
A more sophisticated architecture using LangChain's ReAct pattern, where the agent autonomously decides how many steps to take.
flowchart TD
A[π€ User Input] --> B{π§ Model}
B -->|action| C[π§ Tools]
C -->|observation| B
B -->|finish| D[π¬ Output]
subgraph Tools
C --> E[π Knowledge Base]
C --> F[π Web Search]
end
style B fill:#e1d5e7,stroke:#9673a6
style D fill:#1a1a2e,stroke:#16213e,color:#fff
style Tools fill:#f5f5f5,stroke:#666
How it works:
- The Model receives the query and decides on an action
- If it needs information, it calls a Tool (knowledge base or web search)
- The tool returns an observation back to the model
- The model can iterate (call more tools) or decide to finish
- This loop continues until the model has enough information
Pros: Can handle complex queries requiring multiple sources, more flexible
Cons: Variable costs (more LLM calls), harder to predict behavior
The heart of this POC is the evaluation system that allows objective comparison between architectures.
When building agentic systems, you face critical trade-offs:
| Factor | Classic | LangChain |
|---|---|---|
| Cost per query | Predictable (2 LLM calls) | Variable (2-N calls) |
| Accuracy | Good for simple queries | Better for complex queries |
| Latency | Lower | Higher |
| Debuggability | Easier | More complex |
The evaluation framework lets you measure these trade-offs empirically rather than guessing.
| Category | Description | Example |
|---|---|---|
COMPANY |
Internal knowledge questions | "What's the vacation policy?" |
GENERAL |
Common knowledge queries | "What is Python?" |
WEB_SEARCH |
Real-time information | "Who won the latest Champions League?" |
AMBIGUOUS |
Queries needing clarification | "How do I request time off?" |
HARMFUL |
Policy-violating requests | Blocked queries |
Responses are evaluated using an LLM judge that checks content correctness against expected answers, appropriate handling of blocked queries, and proper clarification requests for ambiguous queries.
- Docker Desktop installed
- OpenRouter API key (or your own LLM API keys)
git clone https://github.com/LucaZilli/company_assistant.git
cd company_assistantCopy the example environment file and add your API key:
cp .env.example .envThen edit .env and set your API key:
OPENROUTER_API_KEY=sk-or-v1-your-key-here
SERPER_API_KEY==your-key-here #if you want to use serper (it is not required)the DEFAULT configuration uses no cache, but you are encouraged to try it by setting
CACHE_ENABLED=Trueand then you must rebuild the docker.
# Build the Docker containers
docker compose build
# Start the services (PostgreSQL for caching + App)
docker compose up -d
# Run database migrations (wait a few seconds for PostgreSQL to be ready)
docker compose exec app python main.py db-migrateNote: Use
docker compose(without hyphen) for latest version of docker desktop. Older versions may requiredocker-compose.
# Interactive chat with Classic architecture
docker compose exec -it app python main.py chat
# Interactive chat with LangChain architecture
docker compose exec -it app python main.py chat-langchainOnce inside the chat, try these example queries to see the assistant in action:
1. Company-related query:
You: What is our vacation policy at zuru?
The assistant retrieves information from the internal knowledge base and responds with company-specific details.
2. General knowledge query:
You: What is Python?
The assistant uses its intrinsic knowledge to answer general questions without searching external sources.
3. Web search query:
You: chi Γ¨ il primo ministro italiano?
The assistant recognizes this requires up-to-date information and performs a web search to provide the current answer.
4. Ambiguous query requiring clarification:
You: Chi devo contattare?
The assistant recognizes the ambiguity and asks a clarifying question before providing an answer.
5. Restricted/harmful query:
You: How do I hack into the company database?
The assistant detects the harmful intent and politely refuses to help, following safety guidelines.
| Command | Description |
|---|---|
python main.py chat |
Classic architecture chat |
python main.py chat-langchain |
LangChain architecture chat |
python main.py chat -d |
Classic with debug output |
python main.py chat-langchain -d |
LangChain with debug output |
python main.py db-migrate |
Run database migrations |
python main.py db-status |
Show migration status |
| Command | Description |
|---|---|
quit / exit |
Exit the assistant |
reset |
Clear conversation history |
docs |
List loaded documents |
cache |
Show cache statistics |
cache clear |
Clear the cache |
The -d flag shows all LLM inputs/outputs for understanding agent behavior:
docker compose exec -it app python main.py chat -d
docker compose exec -it app python main.py chat-langchain -d# Evaluate Classic architecture (default)
docker compose exec app python evaluations/run_eval.py
# Evaluate LangChain architecture
docker compose exec app python evaluations/run_eval.py -a langchain# Single category
docker compose exec app python evaluations/run_eval.py -c COMPANY
# Multiple categories
docker compose exec app python evaluations/run_eval.py -c COMPANY GENERAL WEB_SEARCH
# List available categories
docker compose exec app python evaluations/run_eval.py -l| Flag | Description |
|---|---|
-c, --category |
Categories to test |
-a, --assistant |
Architecture: agent or langchain |
-l, --list |
List available categories |
Results are saved in evaluations/results/ as CSV and JSON files with detailed metrics including accuracy per category and token usage.
company-assistant/
βββ main.py
βββ config.py
βββ requirements.txt
βββ Dockerfile
βββ docker-compose.yml
βββ knowledge_base/
β βββ coding_style.md
β βββ company_policies.md
β βββ company_procedures.md
βββ src/
β βββ migrations.py
β βββ assistants/
β β βββ classic/
β β β βββ agent.py
β β β βββ orchestrator.py
β β βββ langchain/
β β βββ langchain_company_assistant.py
β βββ shared/
β βββ cache.py
β βββ knowledge.py
β βββ llm.py
β βββ logging.py
β βββ safety.py
β βββ usage_tracker.py
β βββ web_search.py
βββ evaluations/
β βββ run_eval.py
β βββ test_cases.py
β βββ results/
βββ migrations/
I chose Instructor as the LLM interface layer rather than raw API calls. Instructor wraps the OpenAI client and provides automatic retries with exponential backoff when the model fails to produce valid output, Pydantic validation ensuring responses match the expected schema, and easy control over output ordering (e.g., forcing chain-of-thought reasoning before the final answer). This makes the orchestrator significantly more reliable β when asking the model to decide which tool to use, I need a guaranteed structured response, not free-form text that might fail to parse.
The current knowledge base consists of just three Markdown documents, which fit comfortably within a single context window. Adding a retrieval layer (embeddings + vector search) would introduce unnecessary complexity and latency for this scale. However, the architecture is designed to scale: PostgreSQL is already in place, and adding a vector extension like pgvector would enable hybrid search (combining keyword and semantic search) without introducing new infrastructure.
The usage_tracker module is a simple in-memory class that accumulates token counts and estimated costs across all LLM calls during an evaluation run. This allows quick comparison between architectures (Classic vs LangChain) without external dependencies. For a production system, this data could easily be persisted to PostgreSQL for long-term cost analysis and optimization.
I selected PostgreSQL because I'm already familiar with it and it's more than fast enough for this use case β checking a local cache is always faster than making an API call. The cache stores query-response pairs with a configurable TTL, reducing costs for repeated questions. As a bonus, PostgreSQL can be extended with pgvector if semantic retrieval becomes necessary, avoiding the need for a separate vector database.
I tested three web search options with different cost/quality trade-offs:
| Provider | Cost | Result Quality | Notes |
|---|---|---|---|
| DuckDuckGo | Free | Poor | Inconsistent results, often irrelevant |
| Serper | ~$0.001/query | Good | Best balance of cost and quality |
| Perplexity | ~$0.005/query | Excellent | Best results but 5x more expensive |
The evaluation framework made this comparison straightforward β running the same test suite against each provider revealed clear accuracy differences that justified the cost increase from DuckDuckGo to Serper.
Docker ensures the application runs identically on any machine regardless of operating system, Python version, or installed dependencies. The docker-compose.yml orchestrates both the application container and PostgreSQL, making setup a single command (docker compose up) rather than a multi-step installation process.