Skip to content

Add TabArena single-table OpenML datasets#369

Open
pc0618 wants to merge 8 commits intosnap-stanford:mainfrom
pc0618:tabarena-single-table
Open

Add TabArena single-table OpenML datasets#369
pc0618 wants to merge 8 commits intosnap-stanford:mainfrom
pc0618:tabarena-single-table

Conversation

@pc0618
Copy link
Copy Markdown
Contributor

@pc0618 pc0618 commented Mar 2, 2026

This PR adds TabArena single-table translations to RelBench via a lightweight OpenML-backed integration.

  • New datasets: tabarena-<slug> (single records table).
  • New tasks: per-dataset fold-<N> tasks (OpenML CV folds; train/val/test tables).
  • Optional dependency: pip install relbench[tabarena] (adds openml).
  • Server downloads are skipped for tabarena-*; datasets/tasks are generated locally from OpenML and cached under ~/.cache/relbench/tabarena-*/.
  • Includes a small helper script: examples/translate_tabarena_to_relbench.py.
  • Adds offline unit tests that stub OpenML so CI doesn’t require network access.

Example usage:

from relbench.datasets import get_dataset
from relbench.tasks import get_task

dataset = get_dataset("tabarena-credit-g", download=False)
task = get_task("tabarena-credit-g", "fold-0", download=False)
train = task.get_table("train")

@pc0618
Copy link
Copy Markdown
Contributor Author

pc0618 commented Mar 2, 2026

CI is green and the PR should be ready for review. Pinging @rishabh-ranjan and @matthiasf.

@rishabh-ranjan
Copy link
Copy Markdown
Collaborator

@JustinGu32 please see that everything looks alright.

We should support downloading from server by uploading the zip files.

I don't understand what's going on with the folds. I will get Vignesh's opinion on it.

@ParthShroff
Copy link
Copy Markdown

@rishabh-ranjan TabArena is built on OpenML tasks which already have predefined CV resampling splits (folds) instead of timestamp splits. Fold-N names which OpenML train/test split we use. The underlying target is still the same across folds and only the partitioning changes. The timestamps in the task tables are synthetic and only exist to fit the Relbnech interface

@pc0618
Copy link
Copy Markdown
Contributor Author

pc0618 commented Mar 4, 2026

Implemented requested refactor on this PR:

  • Renamed TabArena task names from fold-N to split-N in registration and docs/examples.
  • Refactored task class to TabArenaSplitEntityTask (kept TabArenaFoldEntityTask as a backward-compatible alias in code only).
  • Removed synthetic timestamps from TabArena task tables (time_col=None, task tables now contain only record_id + target before masking).
  • Added split-oriented dataset helpers (available_splits, get_openml_split_indices) while keeping fold-named aliases for backward compatibility.

On the AutoCompleteTask point: I kept a custom task class because AutoCompleteTask is time-window based, while TabArena uses predefined OpenML split indices; this keeps split semantics exact.

Branch is updated (tabarena-single-table) with commits f8557b6 + f796cba (and pre-commit.ci formatting commit c7d62a9).

@pc0618
Copy link
Copy Markdown
Contributor Author

pc0618 commented Mar 4, 2026

Follow-up update pushed to tabarena-single-table (commit 790b5db):

  • TabArena task tables are now edge-free (fkey_col_to_pkey_table = {}) to keep single-table context single-node for RT/rustler sampling.
  • Added custom masking in TabArena task to keep record_id available for inference while staying edge-free.
  • Added a detailed PluRel-16B runbook at examples/tabarena_plurel16b_inference.md with:
    • exact pinned SHAs
    • split-name compatibility notes (split-*)
    • random-sampling behavior notes
    • seq_len=2048 and seq_len=4096 command templates
    • no-FK and single-node context sanity checks.

Comment thread README.md Outdated

TabArena datasets are generated locally (from OpenML) and cached under `~/.cache/relbench/tabarena-*/`. Passing `download=True` will skip the RelBench server download step for these datasets/tasks.

For an end-to-end PluRel-16B TabArena inference runbook (including `split-*` task naming, random sampling behavior, and `seq_len=2048/4096` commands), see:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove both these files from examples. you can make a PR to the internal RT repo for these files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I removed the internal RT/PluRel runbook from examples and removed the README reference to it. The PR now only contains public-facing example scripts.

Comment thread README.md
```


**Using TabArena datasets**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rewrite this section to strictly follow the other integration sections (e.g. add citation) in the readme and remove any extra material (e.g. the line about download=True).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The TabArena README section now follows the same style as the other integrations: install line, short description, links to public example scripts, and a citation block. I also removed the extra download=True note.

Comment on lines +6 to +8
- task names are `split-N` (not `fold-N`)
- task tables are edge-free (`fkey_col_to_pkey_table = {}`)
- no synthetic task timestamps (`time_col=None`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem to be internal notes that can be removed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be like an internal exploration file with inference experiment details. This can be removed from the public code

Comment thread README.md Outdated
TabArena datasets are generated locally (from OpenML) and cached under `~/.cache/relbench/tabarena-*/`. Passing `download=True` will skip the RelBench server download step for these datasets/tasks.

For an end-to-end PluRel-16B TabArena inference runbook (including `split-*` task naming, random sampling behavior, and `seq_len=2048/4096` commands), see:
[`examples/tabarena_plurel16b_inference.md`](examples/tabarena_plurel16b_inference.md)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example file needs to be rewritten based only on publicly available data and code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked above to simply remove these example files entirely

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include examples of how to use these tabular datasets, especially since we have the splits involved.
Reproducing the results with tabpfn on our dataset will help ensure people can trust the data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The old internal example was replaced with public examples built only around OpenML data and standard public Python packages. examples/translate_tabarena_to_relbench.py now compares the original OpenML task with the relbenchified records and split-* tables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. examples/translate_tabarena_to_relbench.py now shows how the source rows map to records, how the RelBench test split matches the OpenML test split, and how RelBench train+val partition the OpenML train side. I also added examples/validate_tabarena_baseline.py, which trains a public baseline on the original OpenML rows and verifies that the relbenchified view produces identical features, labels, predictions, and metrics. I did not add an RT example, in line with the later feedback that this does not belong in RelBench.

@kvignesh1420
Copy link
Copy Markdown

It would be great to have the following:

  • An example script showing how relbench-tabarena data can be loaded and explored. This script should also load data from the original tabarena data and compare. Show how the splits differ.
  • Next, in a different script, show how one can load TabPFN/XGBoost models and infer on relbench-tabarena data as well as original-tabarena data to show that our data and evaluation is correct.
  • Then in a third script, show how RT can be used to train/tested on relbench-tabarena

@rishabh-ranjan
Copy link
Copy Markdown
Collaborator

the RT script does not belong here (in RelBench). agree with the other two scripts.

@pc0618
Copy link
Copy Markdown
Contributor Author

pc0618 commented Mar 31, 2026

Pushed an update to pc0618/relbench-fork:tabarena-single-table at commit 4e607bd.

This addresses the review feedback as follows:

  • removed the internal RT/PluRel runbook from examples
  • rewrote the TabArena README section to match the style of the other integrations
  • added a proper TabArena citation
  • removed the extra README material about download=True
  • rewrote the public example to compare the original OpenML task with the relbenchified records and split-* tables
  • added a second public example that validates the relbenchified view against the original OpenML data using a standard baseline

I did not add an RT example, in line with the feedback that this does not belong in RelBench.

Validation run locally before pushing:

  • pytest test/datasets/test_tabarena.py -> 3 passed
  • smoke-tested both new example scripts on credit-g
  • the baseline validation script shows identical features, labels, predictions, and AUROC between the original OpenML view and the relbenchified view

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants