Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

> **This is a performance art project.** Anthropic built their models on the world's freely shared information, then introduced increasingly [dystopian data policies](https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks) to stop anyone else from doing the same with their data — pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.

Turn your Claude Code and Codex conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.
Turn your Claude Code, Codex, and Gemini CLI conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.

![DataClaw](dataclaw.jpeg)

Expand All @@ -13,7 +13,7 @@ Every export is tagged **`dataclaw`** on Hugging Face. Together, they may someda
Paste this into Claude Code, Codex, or any coding agent:

```
Help me export my Claude Code and Codex conversation history to Hugging Face using DataClaw.
Help me export my Claude Code, Codex, and Gemini CLI conversation history to Hugging Face using DataClaw.
Install it, set up the skill, then walk me through the process.

STEP 1 — INSTALL
Expand All @@ -29,12 +29,12 @@ STEP 3 — START
Every dataclaw command outputs next_steps in its JSON — follow them through the entire flow.

STEP 3A — CHOOSE SOURCE SCOPE (REQUIRED BEFORE EXPORT)
Ask the user explicitly: Claude Code, Codex, or both?
dataclaw config --source "claude|codex|both"
Ask the user explicitly: Claude Code, Codex, Gemini CLI, or all?
dataclaw config --source "claude|codex|gemini|all"
Do not export until source scope is explicitly confirmed.

STEP 3B — PRESENT ALL FOLDERS (REQUIRED BEFORE EXPORT)
dataclaw list --source "claude|codex|both"
dataclaw list --source "claude|codex|gemini|all"
Send the FULL project/folder list to the user in a message (name, source, sessions, size, excluded).
Ask which projects to exclude.
dataclaw config --exclude "project1,project2" OR dataclaw config --confirm-projects
Expand Down Expand Up @@ -69,8 +69,8 @@ huggingface-cli login --token YOUR_TOKEN

# See your projects
dataclaw prep
dataclaw config --source both # REQUIRED: choose claude, codex, or both
dataclaw list --source both # Present full list and confirm folder scope before export
dataclaw config --source all # REQUIRED: choose claude, codex, gemini, or all
dataclaw list --source all # Present full list and confirm folder scope before export

# Configure
dataclaw config --repo username/my-personal-codex-data
Expand Down Expand Up @@ -105,23 +105,25 @@ dataclaw export --publish-attestation "User explicitly approved publishing to Hu
|---------|-------------|
| `dataclaw status` | Show current stage and next steps (JSON) |
| `dataclaw prep` | Discover projects, check HF auth, output JSON |
| `dataclaw prep --source both` | Prep with both Claude + Codex explicitly selected |
| `dataclaw prep --source all` | Prep with Claude, Codex, and Gemini explicitly selected |
| `dataclaw prep --source gemini` | Prep using only Gemini CLI sessions |
| `dataclaw prep --source codex` | Prep using only Codex sessions |
| `dataclaw prep --source claude` | Prep using only Claude Code sessions |
| `dataclaw list` | List all projects with exclusion status |
| `dataclaw list --source both` | List both Claude and Codex projects |
| `dataclaw list --source all` | List Claude, Codex, and Gemini projects |
| `dataclaw list --source codex` | List only Codex projects |
| `dataclaw config` | Show current config |
| `dataclaw config --repo user/my-personal-codex-data` | Set HF repo |
| `dataclaw config --source both` | REQUIRED source scope selection (`claude`, `codex`, or `both`) |
| `dataclaw config --source all` | REQUIRED source scope selection (`claude`, `codex`, `gemini`, or `all`) |
| `dataclaw config --exclude "a,b"` | Add excluded projects (appends) |
| `dataclaw config --redact "str1,str2"` | Add strings to always redact (appends) |
| `dataclaw config --redact-usernames "u1,u2"` | Add usernames to anonymize (appends) |
| `dataclaw config --confirm-projects` | Mark project selection as confirmed |
| `dataclaw export --no-push` | Export locally only (always do this first) |
| `dataclaw export --source both --no-push` | Export Claude + Codex sessions locally |
| `dataclaw export --source all --no-push` | Export Claude, Codex, and Gemini sessions locally |
| `dataclaw export --source codex --no-push` | Export only Codex sessions locally |
| `dataclaw export --source claude --no-push` | Export only Claude Code sessions locally |
| `dataclaw export --source gemini --no-push` | Export only Gemini CLI sessions locally |
| `dataclaw confirm --full-name "NAME" --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."` | Scan for PII, run exact-name privacy check, verify review attestations, unlock pushing |
| `dataclaw confirm --skip-full-name-scan --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."` | Skip exact-name scan when user declines sharing full name (requires skip attestation) |
| `dataclaw export --publish-attestation "..."` | Export and push (requires `dataclaw confirm` first) |
Expand Down
49 changes: 30 additions & 19 deletions dataclaw/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

from .anonymizer import Anonymizer
from .config import CONFIG_FILE, DataClawConfig, load_config, save_config
from .parser import CLAUDE_DIR, CODEX_DIR, discover_projects, parse_project_sessions
from .parser import CLAUDE_DIR, CODEX_DIR, GEMINI_DIR, discover_projects, parse_project_sessions
from .secrets import _has_mixed_char_types, _shannon_entropy, redact_session

HF_TAG = "dataclaw"
Expand Down Expand Up @@ -49,15 +49,15 @@

SETUP_TO_PUBLISH_STEPS = [
"Step 1/6: Run prep/list to review project scope: dataclaw prep && dataclaw list",
"Step 2/6: Explicitly choose source scope: dataclaw config --source <claude|codex|both>",
"Step 2/6: Explicitly choose source scope: dataclaw config --source <claude|codex|gemini|all>",
"Step 3/6: Configure exclusions/redactions and confirm projects: dataclaw config ...",
"Step 4/6: Export locally only: dataclaw export --no-push --output /tmp/dataclaw_export.jsonl",
"Step 5/6: Review and confirm: dataclaw confirm ...",
"Step 6/6: After explicit user approval, publish: dataclaw export --publish-attestation \"User explicitly approved publishing to Hugging Face.\"",
]

EXPLICIT_SOURCE_CHOICES = {"claude", "codex", "both"}
SOURCE_CHOICES = ["auto", "claude", "codex", "both"]
EXPLICIT_SOURCE_CHOICES = {"claude", "codex", "gemini", "all", "both"}
SOURCE_CHOICES = ["auto", "claude", "codex", "gemini", "all"]


def _mask_secret(s: str) -> str:
Expand All @@ -81,11 +81,13 @@ def _source_label(source_filter: str) -> str:
return "Claude Code"
if source_filter == "codex":
return "Codex"
return "Claude Code or Codex"
if source_filter == "gemini":
return "Gemini CLI"
return "Claude Code, Codex, or Gemini CLI"


def _normalize_source_filter(source_filter: str) -> str:
if source_filter == "both":
if source_filter in ("all", "both"):
return "auto"
return source_filter

Expand All @@ -102,7 +104,7 @@ def _resolve_source_choice(

Returns:
(source_choice, explicit) where source_choice is one of
"claude" | "codex" | "both" | "auto".
"claude" | "codex" | "gemini" | "all" | "auto".
"""
if _is_explicit_source_choice(requested_source):
return requested_source, True
Expand All @@ -119,7 +121,9 @@ def _has_session_sources(source_filter: str = "auto") -> bool:
return CLAUDE_DIR.exists()
if source_filter == "codex":
return CODEX_DIR.exists()
return CLAUDE_DIR.exists() or CODEX_DIR.exists()
if source_filter == "gemini":
return GEMINI_DIR.exists()
return CLAUDE_DIR.exists() or CODEX_DIR.exists() or GEMINI_DIR.exists()


def _filter_projects_by_source(projects: list[dict], source_filter: str) -> list[dict]:
Expand Down Expand Up @@ -204,14 +208,14 @@ def _build_status_next_steps(
steps = []
if not source_confirmed:
steps.append(
"Ask the user to explicitly choose export source scope: Claude Code, Codex, or both. "
"Then set it: dataclaw config --source <claude|codex|both>. "
"Ask the user to explicitly choose export source scope: Claude Code, Codex, Gemini, or all. "
"Then set it: dataclaw config --source <claude|codex|gemini|all>. "
"Do not run export until source scope is explicitly confirmed."
)
else:
steps.append(
f"Source scope is currently set to '{configured_source}'. "
"If the user wants a different scope, run: dataclaw config --source <claude|codex|both>."
"If the user wants a different scope, run: dataclaw config --source <claude|codex|gemini|all>."
)
if not projects_confirmed:
steps.append(
Expand Down Expand Up @@ -456,6 +460,7 @@ def _build_dataset_card(repo_id: str, meta: dict) -> str:
- dataclaw
- claude-code
- codex-cli
- gemini-cli
- conversations
- coding-assistant
- tool-use
Expand Down Expand Up @@ -1091,8 +1096,11 @@ def prep(source_filter: str = "auto") -> None:
err = "~/.claude was not found."
elif effective_source_filter == "codex":
err = "~/.codex was not found."
elif effective_source_filter == "gemini":
from .parser import GEMINI_DIR
err = f"{GEMINI_DIR} was not found."
else:
err = "Neither ~/.claude nor ~/.codex was found."
err = "None of ~/.claude, ~/.codex, or ~/.gemini/tmp were found."
print(json.dumps({"error": err}))
sys.exit(1)

Expand Down Expand Up @@ -1181,7 +1189,7 @@ def main() -> None:
cfg = sub.add_parser("config", help="View or set config")
cfg.add_argument("--repo", type=str, help="Set HF repo")
cfg.add_argument("--source", choices=sorted(EXPLICIT_SOURCE_CHOICES),
help="Set export source scope explicitly: claude, codex, or both")
help="Set export source scope explicitly: claude, codex, gemini, or all")
cfg.add_argument("--exclude", type=str, help="Comma-separated projects to exclude")
cfg.add_argument("--redact", type=str,
help="Comma-separated strings to always redact (API keys, usernames, domains)")
Expand Down Expand Up @@ -1302,17 +1310,17 @@ def _run_export(args) -> None:
"error": "Source scope is not confirmed yet.",
"hint": (
"Explicitly choose one source scope before exporting: "
"`claude`, `codex`, or `both`."
"`claude`, `codex`, `gemini`, or `all`."
),
"required_action": (
"Ask the user whether to export Claude Code, Codex, or both. "
"Then run `dataclaw config --source <claude|codex|both>` "
"or pass `--source <claude|codex|both>` on the export command."
"Ask the user whether to export Claude Code, Codex, Gemini, or all. "
"Then run `dataclaw config --source <claude|codex|gemini|all>` "
"or pass `--source <claude|codex|gemini|all>` on the export command."
),
"allowed_sources": sorted(EXPLICIT_SOURCE_CHOICES),
"blocked_on_step": "Step 2/6",
"process_steps": SETUP_TO_PUBLISH_STEPS,
"next_command": "dataclaw config --source both",
"next_command": "dataclaw config --source all",
}, indent=2))
sys.exit(1)

Expand Down Expand Up @@ -1396,8 +1404,11 @@ def _run_export(args) -> None:
print(f"Error: {CLAUDE_DIR} not found.", file=sys.stderr)
elif source_filter == "codex":
print(f"Error: {CODEX_DIR} not found.", file=sys.stderr)
elif source_filter == "gemini":
from .parser import GEMINI_DIR
print(f"Error: {GEMINI_DIR} not found.", file=sys.stderr)
else:
print("Error: neither ~/.claude nor ~/.codex was found.", file=sys.stderr)
print("Error: none of ~/.claude, ~/.codex, or ~/.gemini/tmp were found.", file=sys.stderr)
sys.exit(1)

projects = _filter_projects_by_source(discover_projects(), source_filter)
Expand Down
2 changes: 1 addition & 1 deletion dataclaw/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ class DataClawConfig(TypedDict, total=False):
"""Expected shape of the config dict."""

repo: str | None
source: str | None # "claude" | "codex" | "both"
source: str | None # "claude" | "codex" | "gemini" | "all"
excluded_projects: list[str]
redact_strings: list[str]
redact_usernames: list[str]
Expand Down
Loading