Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
edb6867
docs: add plan for Common Voice dataset explorer tool
wavekat-eason Apr 2, 2026
52677d6
ci: add ephemeral Azure VM runner + dataset sync workflows
wavekat-eason Apr 2, 2026
5266cfd
ci: add workflow README, use vars for non-sensitive config
wavekat-eason Apr 2, 2026
3c0f7c5
ci: upgrade azure actions to v3 (Node.js 24), use v5 VM sizes
wavekat-eason Apr 2, 2026
aed54b0
ci: add workflow docs, locale/split options, gitignore .wrangler
wavekat-eason Apr 3, 2026
44dba6b
feat: add Common Voice dataset sync script
wavekat-eason Apr 3, 2026
5ddf1a1
docs: clarify R2 token setup in workflow README
wavekat-eason Apr 3, 2026
69daeaa
fix: wrap D1 batch insert body in object
wavekat-eason Apr 3, 2026
0aaed74
docs: add datasets table, auto-detect locale, split all
wavekat-eason Apr 3, 2026
bc69a1e
fix: clean up NSG, public IP, VNET after VM deletion
wavekat-eason Apr 3, 2026
8236e3e
ci: upgrade Azure VM sizes to v6
wavekat-eason Apr 3, 2026
7fd13c9
ci: disable cleanup job during debugging
wavekat-eason Apr 3, 2026
64259eb
ci: remove locale input, add split all option
wavekat-eason Apr 3, 2026
2b04adb
feat: auto-detect locale/version, datasets table, split all
wavekat-eason Apr 3, 2026
b668e6c
ci: make runner persistent for multiple jobs
wavekat-eason Apr 3, 2026
36d4c54
ci: switch runner to ubuntu-latest-m
wavekat-eason Apr 3, 2026
55c6a2d
fix: migrate D1 tables for missing columns
wavekat-eason Apr 3, 2026
00bab97
feat: preflight D1/R2 credential checks
wavekat-eason Apr 3, 2026
7f2ffb9
fix: R2 upload retry with backoff, lower concurrency
wavekat-eason Apr 3, 2026
4242224
fix: use Buffer instead of stream for R2 uploads
wavekat-eason Apr 3, 2026
e96355c
feat: track has_audio in D1 after R2 upload
wavekat-eason Apr 3, 2026
ef5445c
Runs on "cv-sync"
wavekat-eason Apr 3, 2026
cf93abc
feat: check sync status before download, add force option
wavekat-eason Apr 3, 2026
6efd245
feat: add cv-explorer worker and web app
wavekat-eason Apr 3, 2026
2a0316a
fix: isolate extraction dir per dataset ID
wavekat-eason Apr 3, 2026
21fbc8e
feat: add ETA to R2 upload progress and concurrency option
wavekat-eason Apr 3, 2026
2bada18
feat: add runner selection to cv-sync workflow
wavekat-eason Apr 3, 2026
e61d801
feat: add ubuntu-latest to runner options
wavekat-eason Apr 3, 2026
b1f7eaa
feat: add ETA to D1 insert progress
wavekat-eason Apr 3, 2026
494359b
perf: use multi-row INSERT for D1 batch inserts
wavekat-eason Apr 3, 2026
5f364e1
fix: reduce D1 params per statement to fit D1 limit
wavekat-eason Apr 3, 2026
a0ad824
docs: add auth implementation plan and rebrand to Common Voice Explorer
wavekat-eason Apr 3, 2026
de0069c
feat: show syncing datasets and add audio filter
wavekat-eason Apr 3, 2026
17930bc
feat: replace audio progress bar with waveform display
wavekat-eason Apr 3, 2026
eef0254
fix: flush has_audio updates to D1 during upload
wavekat-eason Apr 3, 2026
680e761
feat: add 128 and 256 R2 concurrency options
wavekat-eason Apr 3, 2026
84645a4
feat: add download button to audio player
wavekat-eason Apr 3, 2026
11a1136
fix: retry download with resume on connection drop
wavekat-eason Apr 3, 2026
6ab3fed
feat: add GitHub OAuth and terms acceptance gate
wavekat-eason Apr 3, 2026
9a87146
fix: use char length instead of word count for CJK languages
wavekat-eason Apr 3, 2026
9581410
feat: add v7 AMD VM size options for runner provisioning
wavekat-eason Apr 3, 2026
2b705c1
fix: stream TSV parsing to avoid string size limit
wavekat-eason Apr 3, 2026
dc6fc71
fix: handle 416 response when download is complete
wavekat-eason Apr 3, 2026
ec95ebf
feat: add CI deployment and static asset serving for cv-explorer
wavekat-eason Apr 3, 2026
d5a44bd
chore: add GitHub OAuth client ID to wrangler config
wavekat-eason Apr 4, 2026
85e007f
fix: add ASSETS binding to wrangler config for SPA routing
wavekat-eason Apr 4, 2026
0687daa
fix: disable assets html_handling to prevent SPA redirect
wavekat-eason Apr 4, 2026
70c8be0
fix: handle base64url encoding in JWT decode
wavekat-eason Apr 4, 2026
cb36bf3
chore: add CV Explorer to README and CI
wavekat-eason Apr 4, 2026
fdba759
chore: use Common Voice Explorer instead of CV
wavekat-eason Apr 4, 2026
32384de
feat: add Google Analytics to Common Voice Explorer
wavekat-eason Apr 4, 2026
d3a4f8f
feat: add GitHub repo link to Explorer header
wavekat-eason Apr 4, 2026
36b4887
feat: redesign CV Explorer login page with card layout
wavekat-eason Apr 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# GitHub Actions Workflows

## CV: Provision Runner (`cv-runner-provision.yml`)

Spins up an ephemeral Azure VM as a self-hosted GitHub Actions runner.
The VM auto-shuts down after a configurable timeout. Zero idle cost.

**Inputs:**

| Input | Default | Description |
|-------|---------|-------------|
| `vm_size` | `Standard_D4s_v3` | Azure VM size (2/4/8 vCPU options) |
| `disk_size_gb` | `256` | OS disk size in GB |
| `max_hours` | `2` | Auto-shutdown after N hours |

### Setup guide — before first run

#### 1. Set your variables

```bash
# List available subscriptions — use the "id" field as SUBSCRIPTION_ID
az account list --output table

# Edit these to match your environment
SUBSCRIPTION_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
LOCATION="australiaeast" # australiaeast, eastasia, japaneast, southeastasia, westus2, etc.
RESOURCE_GROUP="github-runner-rg"
```

#### 2. Create Azure resource group + service principal

```bash
az group create --name "$RESOURCE_GROUP" --location "$LOCATION"

az ad sp create-for-rbac \
--name "github-cv-runner" \
--role Contributor \
--scopes "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP"
```

This outputs JSON like:

```json
{
"appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"displayName": "github-cv-runner",
"password": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
```

For the `AZURE_CREDENTIALS` secret, build this JSON from the output:

```json
{
"clientId": "<appId>",
"clientSecret": "<password>",
"subscriptionId": "<your SUBSCRIPTION_ID>",
"tenantId": "<tenant>"
}
```

#### 3. Create GitHub PAT

GitHub → Settings → Developer settings → Personal access tokens → Fine-grained tokens:
- **Repository access:** this repo only
- **Permissions:** Administration (read/write) — needed to generate runner registration tokens

#### 4. Configure secrets and variables

Repo → Settings → Secrets and variables → Actions:

**Secrets** (sensitive):

| Secret | Value |
|--------|-------|
| `AZURE_CREDENTIALS` | JSON with `clientId`, `clientSecret`, `subscriptionId`, `tenantId` |
| `GH_PAT` | GitHub fine-grained PAT with admin permission |

**Variables** (non-sensitive):

| Variable | Value |
|----------|-------|
| `AZURE_RESOURCE_GROUP` | Resource group name (e.g. `github-runner-rg`) |
| `AZURE_LOCATION` | Azure region — must match the resource group location |

---

## CV: Dataset Sync (`cv-sync.yml`)

Runs the Common Voice dataset sync on the `cv-sync` runner (the Azure VM).
After sync completes, automatically deletes the VM.

**Inputs:**

| Input | Default | Description |
|-------|---------|-------------|
| `locale` | `en` | Common Voice locale (e.g. `en`, `ja`, `zh-TW`) |
| `split` | `validated` | Dataset split (`validated`, `train`, `dev`, `test`) |
| `dataset_id` | *(required)* | Data Collective dataset ID (from the dataset URL) |

### Setup guide — before first sync

#### 1. Install Wrangler (Cloudflare CLI)

```bash
npm install -g wrangler

# Login to Cloudflare
wrangler login
```

This opens a browser to authenticate with your Cloudflare account.

#### 2. Create D1 database and R2 bucket

```bash
# Create the D1 database — note the database ID from the output
wrangler d1 create cv-explorer

# Create the R2 bucket
wrangler r2 bucket create cv-explorer
```

The `d1 create` output will look like:

```
✅ Successfully created DB 'cv-explorer'
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
```

Use that `database_id` as `CV_EXPLORER_D1_ID`.

#### 3. Create Cloudflare API token

Cloudflare dashboard → My Profile → API Tokens → Create Token → **Create Custom Token**:

Permissions (add all three):

| Scope | Resource | Permission |
|-------|----------|------------|
| Account | D1 | Edit |
| Account | R2 Storage | Edit |
| Account | Workers Scripts | Edit |

- **Account resources:** Include → your account
- **Zone resources:** can leave empty (not needed)

#### 4. Create R2 API token (S3-compatible)

> **Note:** R2 S3-compatible tokens can only be created via the Cloudflare dashboard —
> there is no `wrangler` command for this.

Cloudflare dashboard → Storage & Databases → R2 → **Manage R2 API Tokens** → Create API token:
- **Token name:** `cv-explorer-sync`
- **Permissions:** Object Read & Write
- **Bucket:** Apply to specific bucket → `cv-explorer`

After creating, you'll see an **Access Key ID** and **Secret Access Key**. Copy both
immediately — the secret is only shown once. These are separate from the Cloudflare API
token and are used for S3-compatible uploads to R2.

#### 5. Configure secrets and variables

**Secrets** (sensitive):

| Secret | Value |
|--------|-------|
| `DATACOLLECTIVE_API_KEY` | From datacollective.mozillafoundation.org → Profile → Credentials |
| `CLOUDFLARE_API_TOKEN` | The API token from step 3 (for D1) |
| `CV_EXPLORER_D1_ID` | Database ID from `wrangler d1 create` output |
| `R2_ACCESS_KEY_ID` | R2 API token Access Key ID from step 4 |
| `R2_SECRET_ACCESS_KEY` | R2 API token Secret Access Key from step 4 |

**Variables** (non-sensitive):

| Variable | Value |
|----------|-------|
| `CLOUDFLARE_ACCOUNT_ID` | Cloudflare dashboard → Overview → Account ID (right sidebar) |
| `CV_EXPLORER_R2_BUCKET` | Bucket name: `cv-explorer` |

---

## Typical usage

```
1. Trigger "CV: Provision Runner" (pick VM size, disk, max hours)
2. Wait ~2 min for VM to come online
3. Trigger "CV: Dataset Sync" (pick locale, split, version)
4. Sync runs on the Azure VM
5. VM is deleted automatically after sync (or shuts down after max_hours)
```

---

## Useful commands

### Check runner VM status

```bash
# List all runner VMs
az vm list --resource-group github-runner-rg --output table

# Check power state of a specific VM
az vm get-instance-view \
--resource-group github-runner-rg \
--name cv-sync-1775172573 \
--query "instanceView.statuses[1].displayStatus" \
--output tsv
# Output: "VM running", "VM deallocated", "VM stopped", etc.

# Check the scheduled auto-shutdown inside the VM
az vm run-command invoke \
--resource-group github-runner-rg \
--name cv-sync-1775172573 \
--command-id RunShellScript \
--scripts "shutdown --show"
```

### Manually stop or delete a VM

```bash
# Stop (deallocate) — stops billing for compute, keeps the disk
az vm deallocate \
--resource-group github-runner-rg \
--name cv-sync-1775172573

# Delete — removes VM, disk, and NIC entirely
az vm delete \
--resource-group github-runner-rg \
--name cv-sync-1775172573 \
--yes --force-deletion true

# Delete ALL runner VMs in the resource group
az vm list --resource-group github-runner-rg --query "[?starts_with(name, 'cv-sync-')].name" -o tsv | \
xargs -I {} az vm delete --resource-group github-runner-rg --name {} --yes --force-deletion true
```

### Check GitHub runner registration

```bash
# List registered runners (requires GH_TOKEN or gh auth login)
gh api /repos/{owner}/{repo}/actions/runners --jq '.runners[] | {name, status, labels: [.labels[].name]}'
```
24 changes: 24 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,27 @@ jobs:
node --version >> "$GITHUB_STEP_SUMMARY"
npm --version >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"

cv-explorer:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-node@v4
with:
node-version: "22"
cache: npm
cache-dependency-path: |
tools/cv-explorer/worker/package-lock.json
tools/cv-explorer/web/package-lock.json
- run: npm ci
working-directory: tools/cv-explorer/worker
- run: npm ci
working-directory: tools/cv-explorer/web
- run: make ci-cv-explorer
- name: Job summary
if: always()
run: |
echo "## Common Voice Explorer" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
node --version >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
43 changes: 43 additions & 0 deletions .github/workflows/cv-deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: "CV Explorer: Deploy"

on:
push:
branches: [main]
paths:
- "tools/cv-explorer/worker/**"
- "tools/cv-explorer/web/**"
workflow_dispatch:

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- uses: actions/setup-node@v4
with:
node-version: 22
cache: npm
cache-dependency-path: |
tools/cv-explorer/worker/package-lock.json
tools/cv-explorer/web/package-lock.json

- name: Install dependencies
run: |
cd tools/cv-explorer/worker && npm ci
cd ../web && npm ci

- name: Run D1 migrations
run: cd tools/cv-explorer/worker && npx wrangler d1 migrations apply cv-explorer --remote
env:
CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
CLOUDFLARE_ACCOUNT_ID: ${{ vars.CLOUDFLARE_ACCOUNT_ID }}

- name: Build frontend
run: cd tools/cv-explorer/web && npm run build

- name: Deploy worker
run: cd tools/cv-explorer/worker && npx wrangler deploy
env:
CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
CLOUDFLARE_ACCOUNT_ID: ${{ vars.CLOUDFLARE_ACCOUNT_ID }}
Loading
Loading