ChatGen 💬 ✨

ChatGen is an automated platform for generating high-quality fine-tuning datasets for conversational AI agents. This project enables systematic creation of training data through multi-agent conversation simulations, perfect for preparing datasets for model fine-tuning.

✨ Key Features

🤖 Multi-Agent Simulation: Automated conversation generation between testing and main agents.
📊 Fine-tuning Ready: Generates datasets in the exact format required for model fine-tuning.
🎯 Scenario-Based Generation: Support for multiple conversation scenarios with custom prompts.
🔧 Dynamic Tool Schema: Automatically generates tool schemas from Pydantic models.
📈 Complete Message History: Captures system prompts, user messages, assistant responses, tool calls, and tool responses.

⚙️ Prerequisites

OS: Linux, macOS, or Windows
Python: 3.11+
Services:
- OpenAI-compatible LLM inference endpoints (OpenAI, OpenRouter, vLLM, Ollama, etc.)

🛠️ Installation

Clone the repository

git clone https://github.com/taresh18/chatgen.git
cd chatgen

Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate    # Linux/macOS
# venv\Scripts\activate   # Windows

Install dependencies
```
pip install -e .
```

Configure environment variables

cp .env.example .env
nano .env  # Add your API keys and endpoints

Set up conversation scenarios
- Create conversation scenarios in test_suite/scenarios/
- Each scenario needs: test_agent.txt (testing agent prompt)
- Configure main agent prompt in test_suite/main_agent.txt

🏃 Running the Application

Configure your environment

Ensure all API keys and endpoints are set in your .env file.
Generate fine-tuning datasets
```
python -m chatgen.main
```
View generated datasets
- Fine-tuning datasets are saved to timestamped directories in outputs/
- Each transcript contains complete conversation history in fine-tuning format
- Datasets include tools schema and message arrays ready for model training

🏗️ Project Structure

chatgen/
├── test_suite/
│   ├── main_agent.txt          # Main agent system prompt
│   └── scenarios/
│       ├── account_status_check/
│       │   └── test_agent.txt      # Testing agent prompt
│       └── payment_confirmation/
│           └── test_agent.txt
├── chatgen/
│   ├── core/                   # Orchestration and agent factory
│   ├── agents/                 # Agent service implementations
│   ├── models/                 # Data models and schemas
│   ├── tools.py               # Tool functions with Pydantic schemas
│   └── utils/                  # Logger, settings, and utilities
├── outputs/                    # Generated fine-tuning datasets
├── .env.example                # Template for environment variables
├── pyproject.toml              # Project dependencies
├── .gitignore
└── README.md

🎯 Creating Conversation Scenarios

Create scenario directory

mkdir test_suite/scenarios/your_scenario

Add testing agent prompt (test_agent.txt)
- Define the user persona and conversation objectives
- Specify the conversation flow and user behavior
- Use template variables like {{first_name}}, {{phone_number}} for dynamic content
Configure main agent (main_agent.txt)
- Set the system prompt for the agent under test
- This prompt is shared across all scenarios
- Include tool definitions and conversation guidelines

📊 Understanding Generated Datasets

ChatGen generates fine-tuning ready datasets with the following structure:

Dataset Format

Each generated transcript contains:

tools: JSON schema of available tools (auto-generated from Pydantic models)
messages: Complete conversation history with proper roles:
- system: Main agent system prompt
- user: Testing agent messages
- assistant: Main agent responses
- tool_call: Tool invocations with arguments
- tool_response: Tool execution results

Example Output

{
  "tools": "[{\"type\": \"function\", \"function\": {\"name\": \"check_customer_db_tool\", ...}}]",
  "messages": [
    {"role": "system", "content": "You are Jess, a credit services assistant..."},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Thank you for calling..."},
    {"role": "tool_call", "content": "{\"name\": \"check_customer_db_tool\", \"arguments\": {...}}"},
    {"role": "tool_response", "content": "{\"success\": true, \"customer\": {...}}"}
  ]
}

Generation Results

Starting ChatGen...
==================================================

DATASET GENERATION RESULTS
==================================================
Overall Result: PASS
Success Rate: 100.0%
Output Directory: outputs/20251011_150948_gpt-4.1

Summary:
   Total Scenarios: 1
   Passed: 1
   Failed: 0
   Errors: 0

Scenario Results:
   [PASS] account_status_check_no_reference

Transcripts saved to: outputs/20251011_150948_gpt-4.1
==================================================

🎯 Fine-tuning Dataset Format

ChatGen generates datasets in the exact format required for fine-tuning conversational AI models:

Tools Schema

Automatically generated from Pydantic models in tools.py
Includes function names, descriptions, and parameter schemas
Supports dynamic updates when tool functions change

Message Structure

System: Complete agent system prompt
User: Testing agent messages (customer interactions)
Assistant: Main agent responses
Tool Call: Function invocations with arguments
Tool Response: Function execution results

Ready for Training

The generated datasets are immediately usable for:

OpenAI fine-tuning API
Custom model training pipelines
Conversational AI model development
Tool-calling model training

📚 References

Pydantic AI: https://github.com/pydantic/pydantic-ai

📜 License

This project is released under the Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChatGen 💬 ✨

✨ Key Features

⚙️ Prerequisites

🛠️ Installation

🏃 Running the Application

🏗️ Project Structure

🎯 Creating Conversation Scenarios

📊 Understanding Generated Datasets

Dataset Format

Example Output

Generation Results

🎯 Fine-tuning Dataset Format

Tools Schema

Message Structure

Ready for Training

📚 References

📜 License

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
chatgen		chatgen
test_suite		test_suite
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
upload_dataset.py		upload_dataset.py

License

taresh18/chatgen

Folders and files

Latest commit

History

Repository files navigation

ChatGen 💬 ✨

✨ Key Features

⚙️ Prerequisites

🛠️ Installation

🏃 Running the Application

🏗️ Project Structure

🎯 Creating Conversation Scenarios

📊 Understanding Generated Datasets

Dataset Format

Example Output

Generation Results

🎯 Fine-tuning Dataset Format

Tools Schema

Message Structure

Ready for Training

📚 References

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages