chat-mutator

chat-mutator is a Python-based synthetic chat-data mutation framework for stress-testing how chat-based LLMs behave when their conversational context is systematically perturbed. It takes multi-turn chat samples (JSONL), applies a selected “mutation” (predefined or custom), re-generates the assistant response using a configurable model backend, and then produces analysis artifacts to help you understand how the mutation impacted the model’s grounding and output quality.

The repository ships with two primary workflows:

Interactive Streamlit app for uploading/pasting chat samples, selecting mutation types (and optional customizations), choosing the model used for mutation + response generation, and reviewing/exporting results (mutated samples, error logs, per-sample diff links, and an optional grounding/hallucination judge).
Reproducible headless runner (CLI + YAML config) for batch experiments (mirroring four experimental conditions A–D) that processes “frozen” samples, writes generations to a results directory, and aggregates summary + per-mutation metrics (e.g., metrics_overall.json, metrics_by_mutation.json, and tokens_latency.csv).

There are also utilities to convert Hugging Face datasets into frozen JSONL samples (e.g., WebGPT and HotpotQA) so experiments can be reproduced deterministically.

Install dependencies

Install Python version 3.12 on your system
Create a virtual environment in the workspace:
- On Windows, run:
```
py -3.12 -m venv .venv
```
- On Unix or MacOS, run:
```
python3.12 -m venv .venv
```
Activate the virtual environment
- On Windows, run:
```
.venv\Scripts\activate.bat
```
- On Unix or MacOS, run:
```
source .venv/bin/activate.bat
```
Install dependencies from requirements.txt
```
pip install -r requirements.txt
```

Run the Streamlit app

To run the Streamlit app, enter the following command:

streamlit run .\chat_mutator_app.py

The termimal should display a message indicating that you can now view the Streamlit app in your browser. Navigate to the Local URL listed below that message.

Headless runner

The repository also ships with a reproducible, headless runner that mirrors the paper's four experimental conditions (A–D). To launch the smoke-test configuration run:

python tools/runner.py --config configs/exp_pilot.yaml

This command processes the frozen samples in data/samples/pilot.jsonl, writes model generations to results/exp_pilot/samples.jsonl, and aggregates metrics in metrics_overall.json, metrics_by_mutation.json, and tokens_latency.csv.

Convert Hugging Face datasets into frozen samples

The conversion script in tools/build_hf_datasets.py reproduces the WebGPT and HotpotQA processing used in our experiments.

Install the optional dependencies:

pip install -r requirements.huggingface.txt

Run the converter, specifying which dataset to export (webgpt, hotpot, or all). The example below writes the default 1,000-sample subsets to data/:
```
python -m tools.build_hf_datasets all
```
Use --split, --limit, --seed, or --output-dir to override the Hugging Face split, adjust the deterministic sample size, or change the destination directory. Invoke python -m tools.build_hf_datasets --help to see the full CLI reference.

Interpreting outputs

Open results/exp_pilot/metrics_overall.json to review the Attributed Accuracy (AAd), ACE, and other headline numbers for the run. Per-mutation breakdowns are stored in metrics_by_mutation.json, and token/latency accounting lives in tokens_latency.csv.

Use the Synthetic Chat-Data Mutation Framework

Upload a JSONL file of chat samples, or copy and paste the chat samples into the text area.
Select a predefined mutation type, or write your own mutation request.
- Some predefined mutation types provide the option to select a customisation.
- The combinations of mutation type + customisation selection that are performed using an LLM will allow you to view and edit its system prompt and user query.
Choose which model you would like to use to perform the mutations and generate the new responses from the mutated chat samples. The default model is the recommended production model.
The system prompt and parameters that are used during response generation are available to view and edit.
Click Submit.

View results

It may take some time for the chat samples to be processed and for the results to appear, but once they do you will be able to do the following:

Download all mutated chat samples with their new responses
Download an error log to explain if any chat samples failed the mutation process.
Generate links to the Copilot Playground Diff Tool for each chat sample to be able to see the difference highlighting.
Run a judge which will give a score to evaluate the grounding of the new response given the mutated context.
Click through each chat sample individually for a more detailed analysis. To see the info in the 'Extras' tab for each chat sample, the Diff Tool URLs must already be generated and the hallucination judge must be run.

Acknowledgements

Significant portion of this code was developed by Jess Peck during her internship at Microsoft.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/workflows		.github/workflows
.vscode		.vscode
app_components		app_components
chat_cot_mutator.egg-info		chat_cot_mutator.egg-info
clients		clients
configs		configs
core		core
data		data
eval		eval
images		images
mutations		mutations
prompts		prompts
sample_datasets		sample_datasets
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LLM_CLIENT_SETUP.md		LLM_CLIENT_SETUP.md
README.md		README.md
analyze_samples.py		analyze_samples.py
chat_mutator_app.py		chat_mutator_app.py
chat_mutator_controller.py		chat_mutator_controller.py
client_config.py		client_config.py
example_phi_usage.py		example_phi_usage.py
mutation_data.py		mutation_data.py
pyproject.toml		pyproject.toml
rebuild_cache.py		rebuild_cache.py
recompute_metrics.py		recompute_metrics.py
rejudge_answer_correctness.py		rejudge_answer_correctness.py
rejudge_samples.py		rejudge_samples.py
requirements.huggingface.txt		requirements.huggingface.txt
requirements.txt		requirements.txt
test_messages.py		test_messages.py
test_think_extraction.py		test_think_extraction.py
verify_fix.py		verify_fix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chat-mutator

Install dependencies

Run the Streamlit app

Headless runner

Convert Hugging Face datasets into frozen samples

Interpreting outputs

Use the Synthetic Chat-Data Mutation Framework

View results

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

chat-mutator

Install dependencies

Run the Streamlit app

Headless runner

Convert Hugging Face datasets into frozen samples

Interpreting outputs

Use the Synthetic Chat-Data Mutation Framework

View results

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages