🧹 OpenEnv: Data Clean Environment

title

Data Clean Env

emoji

🧹

colorFrom

blue

colorTo

green

sdk

docker

pinned

false

app_port

8000

base_path

/web

🧹 OpenEnv: Data Clean Environment

The Real-World Benchmarking for Agentic Data Engineering

🌟 Overview

Data Clean Env is a high-fidelity, production-grade OpenEnv implementation designed to evaluate and train Reinforcement Learning (RL) agents on the messy, complex reality of Data Cleaning.

Unlike "toy" environments, this project simulates the exact workflow of a data engineer: identifying schema inconsistencies, handling missing values, casting types, and pruning noise from real-world datasets using the power of pandas.

🛠️ Environment Architecture

🧠 Action Space

The agent interacts with the environment through atomic, high-level data operations defined in models.py:

Action	Parameters	Description
`fill_na`	`column_name`, `value`	Replaces missing values with a specific constant.
`drop_na`	`column_name`	Removes rows containing missing data in the target column.
`drop_column`	`column_name`	Deletes irrelevant or noisy features from the dataset.
`rename_column`	`column_name`, `value`	Fixes naming inconsistencies to match target schemas.
`change_type`	`column_name`, `value`	Casts columns to `int`, `float`, or `str` for downstream compatibility.
`submit`	-	Finalizes the cleaning process and triggers the programmatic grader.

👁️ Observation Space

The agent perceives the state of the data through a detailed schema:

df_schema: Real-time dictionary of column data types.
missing_values: Current counts of NaN values per column.
head: A preview of the first 5 rows to identify formatting patterns.
feedback: Semantic descriptions of the impact of the last action.

📈 Task Progression & Grading

Each task is evaluated by a deterministic programmatic grader that compares the agent's output against a "Gold Standard" target, producing a score strictly between (0.0, 1.0).

🟢 Easy (easy_clean):
- Goal: Basic imputation.
- Challenge: Fill missing 'age' values.
🟡 Medium (medium_clean):
- Goal: Noise reduction.
- Challenge: Handle missing values across multiple columns and remove "junk" features.
🔴 Hard (hard_clean):
- Goal: Full schema alignment.
- Challenge: Rename columns, perform safe type casting on dirty strings, and handle complex missing value fallbacks.

🚀 Quick Start

🐳 Run with Docker

# Build the production image
docker build -t openenv_data_clean:latest -f server/Dockerfile .

# Start the environment server
docker run -p 8000:8000 openenv_data_clean:latest

🧪 Baseline Inference

We provide a deterministic, zero-temperature baseline script using the OpenAI client:

export HF_TOKEN="your_huggingface_token"
export IMAGE_NAME="openenv_data_clean:latest"
python inference.py

⚖️ Reward Shaping

Our reward function is designed for efficient RL convergence:

Incremental Progress: +0.1 for every valid schema improvement.
Penalization: -0.05 for invalid operations (e.g., targetting non-existent columns).
Completion Bonus: A final reward scaling with the total grader score [0.01 - 0.99].

🎯 Meta Hackathon Compliance

✅ Typed Models: Fully Pydantic-powered Observation and Action.
✅ API Standard: Implements step(), reset(), and state().
✅ Strict Logs: Emits [START], [STEP], and [END] traces exactly as required.
✅ Robustness: Handles network timeouts and invalid JSON carefully.

Built with ❤️ for the Meta & Hugging Face OpenEnv Hackathon.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
openenv_data_clean_env.egg-info		openenv_data_clean_env.egg-info
server		server
tests		tests
README.md		README.md
__init__.py		__init__.py
client.py		client.py
inference.py		inference.py
local_smoke.py		local_smoke.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧹 OpenEnv: Data Clean Environment

The Real-World Benchmarking for Agentic Data Engineering

🌟 Overview

🛠️ Environment Architecture

🧠 Action Space

👁️ Observation Space

📈 Task Progression & Grading

🚀 Quick Start

🐳 Run with Docker

🧪 Baseline Inference

⚖️ Reward Shaping

🎯 Meta Hackathon Compliance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧹 OpenEnv: Data Clean Environment

The Real-World Benchmarking for Agentic Data Engineering

🌟 Overview

🛠️ Environment Architecture

🧠 Action Space

👁️ Observation Space

📈 Task Progression & Grading

🚀 Quick Start

🐳 Run with Docker

🧪 Baseline Inference

⚖️ Reward Shaping

🎯 Meta Hackathon Compliance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages