Skip to content

taresh18/chatgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChatGen 💬 ✨

License

ChatGen is an automated platform for generating high-quality fine-tuning datasets for conversational AI agents. This project enables systematic creation of training data through multi-agent conversation simulations, perfect for preparing datasets for model fine-tuning.

✨ Key Features

  • 🤖 Multi-Agent Simulation: Automated conversation generation between testing and main agents.
  • 📊 Fine-tuning Ready: Generates datasets in the exact format required for model fine-tuning.
  • 🎯 Scenario-Based Generation: Support for multiple conversation scenarios with custom prompts.
  • 🔧 Dynamic Tool Schema: Automatically generates tool schemas from Pydantic models.
  • 📈 Complete Message History: Captures system prompts, user messages, assistant responses, tool calls, and tool responses.

⚙️ Prerequisites

  • OS: Linux, macOS, or Windows
  • Python: 3.11+
  • Services:
    • OpenAI-compatible LLM inference endpoints (OpenAI, OpenRouter, vLLM, Ollama, etc.)

🛠️ Installation

  1. Clone the repository

    git clone https://github.com/taresh18/chatgen.git
    cd chatgen
  2. Create a virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate    # Linux/macOS
    # venv\Scripts\activate   # Windows
  3. Install dependencies

    pip install -e .
  4. Configure environment variables

    cp .env.example .env
    nano .env  # Add your API keys and endpoints
  5. Set up conversation scenarios

    • Create conversation scenarios in test_suite/scenarios/
    • Each scenario needs: test_agent.txt (testing agent prompt)
    • Configure main agent prompt in test_suite/main_agent.txt

🏃 Running the Application

  1. Configure your environment

    Ensure all API keys and endpoints are set in your .env file.

  2. Generate fine-tuning datasets

    python -m chatgen.main
  3. View generated datasets

    • Fine-tuning datasets are saved to timestamped directories in outputs/
    • Each transcript contains complete conversation history in fine-tuning format
    • Datasets include tools schema and message arrays ready for model training

🏗️ Project Structure

chatgen/
├── test_suite/
│   ├── main_agent.txt          # Main agent system prompt
│   └── scenarios/
│       ├── account_status_check/
│       │   └── test_agent.txt      # Testing agent prompt
│       └── payment_confirmation/
│           └── test_agent.txt
├── chatgen/
│   ├── core/                   # Orchestration and agent factory
│   ├── agents/                 # Agent service implementations
│   ├── models/                 # Data models and schemas
│   ├── tools.py               # Tool functions with Pydantic schemas
│   └── utils/                  # Logger, settings, and utilities
├── outputs/                    # Generated fine-tuning datasets
├── .env.example                # Template for environment variables
├── pyproject.toml              # Project dependencies
├── .gitignore
└── README.md

🎯 Creating Conversation Scenarios

  1. Create scenario directory

    mkdir test_suite/scenarios/your_scenario
  2. Add testing agent prompt (test_agent.txt)

    • Define the user persona and conversation objectives
    • Specify the conversation flow and user behavior
    • Use template variables like {{first_name}}, {{phone_number}} for dynamic content
  3. Configure main agent (main_agent.txt)

    • Set the system prompt for the agent under test
    • This prompt is shared across all scenarios
    • Include tool definitions and conversation guidelines

📊 Understanding Generated Datasets

ChatGen generates fine-tuning ready datasets with the following structure:

Dataset Format

Each generated transcript contains:

  • tools: JSON schema of available tools (auto-generated from Pydantic models)
  • messages: Complete conversation history with proper roles:
    • system: Main agent system prompt
    • user: Testing agent messages
    • assistant: Main agent responses
    • tool_call: Tool invocations with arguments
    • tool_response: Tool execution results

Example Output

{
  "tools": "[{\"type\": \"function\", \"function\": {\"name\": \"check_customer_db_tool\", ...}}]",
  "messages": [
    {"role": "system", "content": "You are Jess, a credit services assistant..."},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Thank you for calling..."},
    {"role": "tool_call", "content": "{\"name\": \"check_customer_db_tool\", \"arguments\": {...}}"},
    {"role": "tool_response", "content": "{\"success\": true, \"customer\": {...}}"}
  ]
}

Generation Results

Starting ChatGen...
==================================================

DATASET GENERATION RESULTS
==================================================
Overall Result: PASS
Success Rate: 100.0%
Output Directory: outputs/20251011_150948_gpt-4.1

Summary:
   Total Scenarios: 1
   Passed: 1
   Failed: 0
   Errors: 0

Scenario Results:
   [PASS] account_status_check_no_reference

Transcripts saved to: outputs/20251011_150948_gpt-4.1
==================================================

🎯 Fine-tuning Dataset Format

ChatGen generates datasets in the exact format required for fine-tuning conversational AI models:

Tools Schema

  • Automatically generated from Pydantic models in tools.py
  • Includes function names, descriptions, and parameter schemas
  • Supports dynamic updates when tool functions change

Message Structure

  • System: Complete agent system prompt
  • User: Testing agent messages (customer interactions)
  • Assistant: Main agent responses
  • Tool Call: Function invocations with arguments
  • Tool Response: Function execution results

Ready for Training

The generated datasets are immediately usable for:

  • OpenAI fine-tuning API
  • Custom model training pipelines
  • Conversational AI model development
  • Tool-calling model training

📚 References


📜 License

This project is released under the Apache License 2.0. See the LICENSE file for details.

About

Generate high-quality dataset for agentic applications

Resources

License

Stars

Watchers

Forks

Languages