ChatGen is an automated platform for generating high-quality fine-tuning datasets for conversational AI agents. This project enables systematic creation of training data through multi-agent conversation simulations, perfect for preparing datasets for model fine-tuning.
- 🤖 Multi-Agent Simulation: Automated conversation generation between testing and main agents.
- 📊 Fine-tuning Ready: Generates datasets in the exact format required for model fine-tuning.
- 🎯 Scenario-Based Generation: Support for multiple conversation scenarios with custom prompts.
- 🔧 Dynamic Tool Schema: Automatically generates tool schemas from Pydantic models.
- 📈 Complete Message History: Captures system prompts, user messages, assistant responses, tool calls, and tool responses.
- OS: Linux, macOS, or Windows
- Python: 3.11+
- Services:
- OpenAI-compatible LLM inference endpoints (OpenAI, OpenRouter, vLLM, Ollama, etc.)
-
Clone the repository
git clone https://github.com/taresh18/chatgen.git cd chatgen
-
Create a virtual environment (recommended)
python -m venv venv source venv/bin/activate # Linux/macOS # venv\Scripts\activate # Windows
-
Install dependencies
pip install -e .
-
Configure environment variables
cp .env.example .env nano .env # Add your API keys and endpoints
-
Set up conversation scenarios
- Create conversation scenarios in
test_suite/scenarios/
- Each scenario needs:
test_agent.txt
(testing agent prompt) - Configure main agent prompt in
test_suite/main_agent.txt
- Create conversation scenarios in
-
Configure your environment
Ensure all API keys and endpoints are set in your
.env
file. -
Generate fine-tuning datasets
python -m chatgen.main
-
View generated datasets
- Fine-tuning datasets are saved to timestamped directories in
outputs/
- Each transcript contains complete conversation history in fine-tuning format
- Datasets include tools schema and message arrays ready for model training
- Fine-tuning datasets are saved to timestamped directories in
chatgen/
├── test_suite/
│ ├── main_agent.txt # Main agent system prompt
│ └── scenarios/
│ ├── account_status_check/
│ │ └── test_agent.txt # Testing agent prompt
│ └── payment_confirmation/
│ └── test_agent.txt
├── chatgen/
│ ├── core/ # Orchestration and agent factory
│ ├── agents/ # Agent service implementations
│ ├── models/ # Data models and schemas
│ ├── tools.py # Tool functions with Pydantic schemas
│ └── utils/ # Logger, settings, and utilities
├── outputs/ # Generated fine-tuning datasets
├── .env.example # Template for environment variables
├── pyproject.toml # Project dependencies
├── .gitignore
└── README.md
-
Create scenario directory
mkdir test_suite/scenarios/your_scenario
-
Add testing agent prompt (
test_agent.txt
)- Define the user persona and conversation objectives
- Specify the conversation flow and user behavior
- Use template variables like
{{first_name}}
,{{phone_number}}
for dynamic content
-
Configure main agent (
main_agent.txt
)- Set the system prompt for the agent under test
- This prompt is shared across all scenarios
- Include tool definitions and conversation guidelines
ChatGen generates fine-tuning ready datasets with the following structure:
Each generated transcript contains:
tools
: JSON schema of available tools (auto-generated from Pydantic models)messages
: Complete conversation history with proper roles:system
: Main agent system promptuser
: Testing agent messagesassistant
: Main agent responsestool_call
: Tool invocations with argumentstool_response
: Tool execution results
{
"tools": "[{\"type\": \"function\", \"function\": {\"name\": \"check_customer_db_tool\", ...}}]",
"messages": [
{"role": "system", "content": "You are Jess, a credit services assistant..."},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Thank you for calling..."},
{"role": "tool_call", "content": "{\"name\": \"check_customer_db_tool\", \"arguments\": {...}}"},
{"role": "tool_response", "content": "{\"success\": true, \"customer\": {...}}"}
]
}
Starting ChatGen...
==================================================
DATASET GENERATION RESULTS
==================================================
Overall Result: PASS
Success Rate: 100.0%
Output Directory: outputs/20251011_150948_gpt-4.1
Summary:
Total Scenarios: 1
Passed: 1
Failed: 0
Errors: 0
Scenario Results:
[PASS] account_status_check_no_reference
Transcripts saved to: outputs/20251011_150948_gpt-4.1
==================================================
ChatGen generates datasets in the exact format required for fine-tuning conversational AI models:
- Automatically generated from Pydantic models in
tools.py
- Includes function names, descriptions, and parameter schemas
- Supports dynamic updates when tool functions change
- System: Complete agent system prompt
- User: Testing agent messages (customer interactions)
- Assistant: Main agent responses
- Tool Call: Function invocations with arguments
- Tool Response: Function execution results
The generated datasets are immediately usable for:
- OpenAI fine-tuning API
- Custom model training pipelines
- Conversational AI model development
- Tool-calling model training
- Pydantic AI: https://github.com/pydantic/pydantic-ai
This project is released under the Apache License 2.0. See the LICENSE file for details.