Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

> **This is a performance art project.** Anthropic built their models on the world's freely shared information, then introduced increasingly [dystopian data policies](https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks) to stop anyone else from doing the same with their data — pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.

Turn your Claude Code, Codex, and Gemini CLI conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.
Turn your Claude Code, Codex, Gemini CLI, and OpenCode conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.

![DataClaw](dataclaw.jpeg)

Every export is tagged **`dataclaw`** on Hugging Face. Together, they may someday form a growing [distributed dataset](https://huggingface.co/datasets?other=dataclaw) of real-world human-AI coding collaboration.

## Give this to your agent

Paste this into Claude Code, Codex, or any coding agent:
Paste this into Claude Code, Codex, Gemini CLI, OpenCode, or any coding agent:

```
Help me export my Claude Code, Codex, Gemini CLI, and OpenCode conversation history to Hugging Face using DataClaw.
Expand Down Expand Up @@ -143,8 +143,7 @@ dataclaw export --publish-attestation "User explicitly approved publishing to Hu
| User messages | Yes | Full text (including voice transcripts) |
| Assistant responses | Yes | Full text output |
| Extended thinking | Yes | Claude's reasoning (opt out with `--no-thinking`) |
| Tool calls | Yes | Tool name + summarized input |
| Tool results | No | Not stored in session logs |
| Tool calls | Yes | Tool name + inputs + outputs |
| Token usage | Yes | Input/output tokens per session |
| Model & metadata | Yes | Model name, git branch, timestamps |

Expand All @@ -158,7 +157,7 @@ DataClaw applies multiple layers of protection:
4. **Entropy analysis** — Long high-entropy strings in quotes are flagged as potential secrets
5. **Email redaction** — Personal email addresses removed
6. **Custom redaction** — You can configure additional strings and usernames to redact
7. **Tool input pre-redaction** — Secrets in tool inputs are redacted BEFORE truncation to prevent partial leaks
7. **Tool call redaction** — Secrets in tool inputs and outputs are redacted

**This is NOT foolproof.** Always review your exported data before publishing.
Automated redaction cannot catch everything — especially service-specific
Expand Down Expand Up @@ -187,7 +186,14 @@ Each line in `conversations.jsonl` is one session:
"role": "assistant",
"content": "I'll investigate the login flow.",
"thinking": "The user wants me to look at...",
"tool_uses": [{"tool": "Read", "input": "src/auth.py"}],
"tool_uses": [
{
"tool": "bash",
"input": {"command": "grep -r 'login' src/"},
"output": {"text": "src/auth.py:42: def login(user, password):"},
"status": "success"
}
],
"timestamp": "..."
}
],
Expand Down Expand Up @@ -221,7 +227,7 @@ All repos are named `{username}/my-personal-codex-data` and tagged `dataclaw`.
```

The auto-generated HF README includes:
- Model distribution (which Claude models, how many sessions each)
- Model distribution (which models, how many sessions each)
- Total token counts
- Project count
- Last updated timestamp
Expand Down
13 changes: 10 additions & 3 deletions dataclaw/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,7 @@ def _build_dataset_card(repo_id: str, meta: dict) -> str:
- claude-code
- codex-cli
- gemini-cli
- opencode
- conversations
- coding-assistant
- tool-use
Expand All @@ -481,7 +482,7 @@ def _build_dataset_card(repo_id: str, meta: dict) -> str:

# Coding Agent Conversation Logs

> **This is a performance art project.** Anthropic built their models on the world's freely shared information, then introduced increasingly [dystopian data policies](https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks) to stop anyone else from doing the same — pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.
> **This is a performance art project.** Anthropic built their models on the world's freely shared information, then introduced increasingly [dystopian data policies](https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks) to stop anyone else from doing the same with their data — pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.

Exported with [DataClaw]({REPO_URL}).

Expand Down Expand Up @@ -521,7 +522,14 @@ def _build_dataset_card(repo_id: str, meta: dict) -> str:
"role": "assistant",
"content": "I'll investigate the login flow.",
"thinking": "The user wants me to...",
"tool_uses": [{{"tool": "Read", "input": "src/auth.py"}}],
"tool_uses": [
{{
"tool": "bash",
"input": {{"command": "grep -r 'login' src/"}},
"output": {{"text": "src/auth.py:42: def login(user, password):"}},
"status": "success"
}}
],
"timestamp": "..."
}}
],
Expand All @@ -538,7 +546,6 @@ def _build_dataset_card(repo_id: str, meta: dict) -> str:
### Privacy

- Paths anonymized to project-relative; usernames hashed
- No tool outputs — only tool call inputs (summaries)

## Load

Expand Down
Loading