Skip to content

PoC of an AI research assistant with OpenAI API and Tavily Search (test task)

Notifications You must be signed in to change notification settings

AlexanderKazakov/testtask_aiagent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Engineering Take‑Home -- Research Assistant

(Example inspiration: TODO/plan executors such as Cursor — plan → execute → log)

Build a small AI agent that helps users tackle complex goals by breaking them into actionable steps and executing them.

Demo video: https://drive.google.com/file/d/1GfZFA9w_loGDRyLv-4hhla3V6mYl2jlR/view?usp=sharing

Run instructions

Requirements

  • Python 3.12+
  • uv package manager

Installation

git clone <repo-url>
cd testtask_aiagent
uv sync

Configuration

Copy the example env file and add your API keys:

cp .env.example .env

Required keys:

  • OPENAI_API_KEY — OpenAI API key
  • TAVILY_API_KEY — Tavily API key for web search

Usage

uv run python -m agent

Development

uv run pytest              # run tests
uv run ruff check src/     # lint

General

  • What type of goals or domain to focus on - General Research Assistant
  • How the AI interaction works (chat, CLI, minimal UI, etc.) - Rich chat in CLI
  • What level of automation vs. user confirmation you provide:
    • User states the task
    • The plan is built
    • User rejects the plan and adds clarifications
    • The new plan is built and accepted
    • The research is done (by calling assistants internally)
    • The final report is provided to the user
    • The user adds some new demands/questions/instructions
    • The new plan is built based on the previous context and new demands
    • ...
    • Persistence - main user conversation can be resumed from json

Please tell us how you spent your time and what trade‑offs you made.

I spent about 2 days it total:

  • 2-3 hours planning
  • 1h finalized the blueprint with Claude
  • 5h run Claude to implement the basic structures
  • 5h clean up and make it work
  • 2h add token counter and context handling
  • 3h running final demos (adjusted one prompt) and writing the report

What has been done:

1. Context & Prompt Engineering (35%)

  • Clear prompt structure and instructions - prompts/templates
  • Thoughtful context selection (what to keep vs. drop) - assistants contexts are separate, only reports are shared throughout the system
  • Basic handling of longer conversations or state / Avoiding prompt bloat - see task_executor, plan_executor, and __main__, see the demo transcript of a very short context window handling demo_transcripts/1_eu_diesel_ban_5k_context_window_85e9aed3

2. Agent Loop & Tool Use (45%)

  • High‑level goal → structured TODO list - planner
  • Simple execution loop (select task → execute → update status) - plan_executor
  • Integration of at least one real tool (web search, document reading, API call, vector search, etc.) - web search & extraction (Tavily API) - tools
  • Transparent logging of what the agent is doing - full debug logging to file, rich-formatted transcripts of the main conversation and task flows, see demo_transcripts, basic MLflow tracing

3. Evaluation & Communication (20%)

  • Clear explanation of how you would test or evaluate the system - that's a task probably bigger than the programming:
  1. Basic python unit tests, some are already in tests
  2. Gold standard dataset with a) basic ruled checks, b) llm-based evaluations

Design and trade‑offs

  1. The interaction with user is described abouve, and it's an important part of the design, too

  2. Within assistant:

    • tasks are executed sequentially as planned
    • reports of previous tasks are shared, so each assistant sees the main goal, work done so far, and its own task
    • all reports are then passed to the main researcher to produce final answer
  3. Withing task:

    • tools are called until the model stops to require more tool calls and gives the final text answer
    • if tool call limit or context window limit is reached, the model is asked to provide the final report
    • the context window limit is handled intelligently, with an offset to leave space for the model to give the final report
  4. In general, see demo_transcripts, they show the process clearly

What I would improve

  1. Of course, at first, a minimum gold standard dataset should be gathered, to improve more or less reliably

  2. The main flaw of the current design is a fixed plan. That may be good for a coding assistant, but inappropriate for new information researh. I would make assistants creation dynamic, depending on what has been found so far. Of course, that would require context handling in the main researcher itself

  3. Web page extraction should be wrapped into a separate summarization/extraction LLM call, so that more web pages can be extracted within a single assistant run without hitting the context limit

  4. As a main researcher, a better model should be used, not gpt-5-mini

About

PoC of an AI research assistant with OpenAI API and Tavily Search (test task)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published