Ship faster. Stay compliant. Scale to Data Mesh.
Quick Start β’ Features β’ Documentation β’ Contributing
floe is an open platform for building internal data platforms.
Platform teams choose their stack from 12 plugin types:
- Compute: DuckDB, Snowflake, Databricks, Spark, BigQuery
- Orchestrator: Dagster, Airflow 3.x
- Catalog: Polaris, AWS Glue, Unity Catalog
- Observability: Split into TelemetryBackend (Jaeger, Datadog) + LineageBackend (Marquez, Atlan)
- [... 8 more plugin types]
Data teams get opinionated workflows:
- β 30 lines replaces 300+ lines of boilerplate
- β Same config works everywhere (dev/staging/prod parity)
- β Standards enforced automatically (compile-time validation)
- β Full composability (swap DuckDB β Snowflake without pipeline changes)
Batteries included. Fully customizable. Production-ready.
Platform engineers supporting 50+ data teams face:
- Integration hell: Stitching together 15+ tools that don't talk to each other
- Exception management: Every team has a "unicorn use case" that breaks your framework
- RBAC sprawl: Managing 1200+ credentials across teams, environments, services
- Security whack-a-mole: Someone always finds a way to hardcode production secrets
Data engineers shipping data products face:
- Governance theater: 3 meetings to approve a pipeline (64% struggle to embed governance in workflows)
- Platform dependency: Blocked for 2 weeks because "platform team is busy" (63% say leaders don't understand their pain)
- Framework limitations: Can't do what you need β shadow IT or 6-month wait
- Unclear requirements: "I thought 80% test coverage was optional?"
Result: Governance blocks teams instead of enabling them.
For platform teams:
- Get a pre-integrated stack (DuckDB + Dagster + Polaris + dbt tested together)
- Say "yes" to edge cases with plugin architecture (add Spark? Swap ComputePlugin. Need Kafka? Add IngestionPlugin)
- Automatic credential vending (SecretReference pattern, manage 1 OAuth config instead of 1200 secrets)
- Enforce at compile-time (violations caught before deployment, not in production)
For data teams:
- Governance = automatic (compile checks replace meetings)
- Get capabilities instantly (platform adds plugin, you use it immediately)
- Escape hatches built-in (plugin system extensible for your unicorn use case)
- Requirements explicit (minimum_test_coverage: 80 in manifest.yaml, not tribal knowledge)
If it compiles, it's compliant.
Composable architecture: Mix and match from 13 plugin types
# manifest.yaml (50 lines supports 200 pipelines)
compute:
approved:
- name: duckdb # Cost-effective analytics
- name: spark # Heavy processing
- name: snowflake # Enterprise warehouse
default: duckdb # Used when transform doesn't specify
orchestrator: dagster # Or: airflow
catalog: polaris # Or: glue, unity-catalog
governance:
naming_pattern: medallion # bronze/silver/gold layers
minimum_test_coverage: 80 # Explicit, not ambiguous
block_on_failure: true # Enforced, not suggestedDeclarative config: Same across all 50 teams. Select compute per-step from approved list.
# floe.yaml (30 lines replaces 300 lines of boilerplate)
name: customer-analytics
version: "0.1.0"
transforms:
- type: dbt
path: ./dbt/staging
compute: spark # Heavy processing on Spark
- type: dbt
path: ./dbt/marts
compute: duckdb # Analytics on DuckDB
schedule:
cron: "0 6 * * *"Compilation phase (2 seconds, catches violations before deployment):
$ floe compile
[1/3] Loading platform policies
β Platform: acme-data-platform v1.2.3
[2/3] Validating pipeline
β Naming: bronze_customers (compliant)
β Test coverage: 85% (>80% required)
[3/3] Generating artifacts
β Dagster assets (Python)
β dbt profiles (YAML)
β Kubernetes manifests (YAML)
β Credentials (vended automatically)
Compilation SUCCESS - ready to deployWhat's auto-generated:
- β Database connection configs (dbt profiles.yml)
- β Orchestration code (Dagster assets or Airflow DAGs)
- β Kubernetes manifests (Jobs, Services, ConfigMaps)
- β Environment-specific settings (dev/staging/prod)
- β Credential vending (SecretReference pattern, no hardcoded secrets)
Same floe.yaml works across dev, staging, production.
Choose from 12 plugin types. Swap implementations without breaking pipelines.
Multi-compute pipelines: Platform teams approve N compute targets. Data engineers select per-step from the approved list. Different steps can use different engines:
# manifest.yaml (Platform Team)
compute:
approved:
- name: spark # Heavy processing
- name: duckdb # Cost-effective analytics
- name: snowflake # Enterprise warehouse
default: duckdb
# floe.yaml (Data Engineers)
transforms:
- type: dbt
path: models/staging/
compute: spark # Process 10TB raw data
- type: dbt
path: models/marts/
compute: duckdb # Build metrics on 100GB resultEnvironment parity preserved: Each step uses the SAME compute across dev/staging/prod. No "works in dev, fails in prod" surprises.
Real-world swap scenarios:
- DuckDB (embedded, cost-effective) β Snowflake (managed, elastic)
- Dagster (asset-centric) β Airflow 3.x (DAG-based)
- Jaeger (self-hosted) β Datadog (managed SaaS)
Plugin types: Compute, Orchestrator, Catalog, Storage, TelemetryBackend, LineageBackend, DBT, SemanticLayer, Ingestion, DataQuality, Secrets, Identity
Two-tier YAML. Platform team defines infrastructure. Data teams define logic.
No code generation anxiety: Compiled artifacts are checked into git. Diff them. Review them. Trust them.
Catch errors before deployment. No runtime surprises.
Example:
$ floe compile
[FAIL] 'stg_payments' violates naming convention
Expected: bronze_*, silver_*, gold_*
[FAIL] 'gold_revenue' missing required tests
Required: [unique_pk, not_null_pk, documentation]
Compilation FAILED - fix violations before deploymentNot documentation governance. Computational governance.
Layer boundaries enforce separation:
- Credentials in platform config β Data teams cannot access
- Automatic vending with SecretReference β No hardcoded secrets possible
- Layer architecture β Data teams cannot override platform policies
- Type-safe schemas β Catch errors at compile-time
Result: Manage 1 OAuth config instead of 1200 credentials.
Same pipeline config works everywhere:
| Environment | Platform Config | Pipeline Config |
|---|---|---|
| Dev | DuckDB (local cluster) | floe.yaml (no changes) |
| Staging | DuckDB (shared cluster) | floe.yaml (no changes) |
| Prod | DuckDB (production cluster) | floe.yaml (no changes) |
Or swap to Snowflake, Databricks, or Sparkβthe pipeline config stays identical.
Result: No "works on my machine" issues. No config drift. What you test is what you deploy.
Federated ownership with computational governance:
- Enterprise policies β Domain constraints β Data products (three-tier hierarchy)
- Data contracts as code (ODCS standard, auto-validated)
- Compile-time + runtime enforcement (not meetings)
- Domain teams have autonomy within guardrails
Scale from single platform to federated Data Mesh without rebuilding.
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'16px'}}}%%
flowchart TB
L4["<b>Layer 4: DATA</b><br/>Ephemeral Jobs<br/><br/>Owner: Data Engineers<br/>β’ Write SQL transforms<br/>β’ Define schedules<br/>β’ INHERIT platform constraints"]
L3["<b>Layer 3: SERVICES</b><br/>Long-lived Infrastructure<br/><br/>Owner: Platform Engineers<br/>β’ Orchestrator, Catalog<br/>β’ Observability services<br/>β’ Always running, health probes"]
L2["<b>Layer 2: CONFIGURATION</b><br/>Immutable Policies<br/><br/>Owner: Platform Engineers<br/>β’ Plugin selection<br/>β’ Governance rules<br/>β’ ENFORCED at compile-time"]
L1["<b>Layer 1: FOUNDATION</b><br/>Framework Code<br/><br/>Owner: floe Maintainers<br/>β’ Schemas, validation engine<br/>β’ Distributed via PyPI + Helm"]
L4 -->|Connects to| L3
L3 -->|Configured by| L2
L2 -->|Built on| L1
classDef dataLayer fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
classDef serviceLayer fill:#F5A623,stroke:#D68910,stroke-width:3px,color:#fff
classDef configLayer fill:#9013FE,stroke:#6B0FBF,stroke-width:3px,color:#fff
classDef foundationLayer fill:#50E3C2,stroke:#2EB8A0,stroke-width:3px,color:#fff
class L4 dataLayer
class L3 serviceLayer
class L2 configLayer
class L1 foundationLayer
Key principle: Configuration flows downward only. Data teams cannot weaken platform policies.
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'16px'}}}%%
flowchart LR
PM["<b>manifest.yaml</b><br/><br/>Platform Engineers<br/><br/>Infrastructure<br/>Credentials<br/>Governance policies"]
FL["<b>floe.yaml</b><br/><br/>Data Engineers<br/><br/>Pipeline logic<br/>Transforms<br/>Schedules"]
PM -->|Resolves to| FL
classDef platformConfig fill:#F5A623,stroke:#D68910,stroke-width:3px,color:#fff
classDef dataConfig fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
class PM platformConfig
class FL dataConfig
| File | Audience | Contains |
|---|---|---|
manifest.yaml |
Platform Engineers | Infrastructure, credentials, governance policies |
floe.yaml |
Data Engineers | Pipeline logic, transforms, schedules |
Benefit: Data teams never see credentials or infrastructure details. Platform team controls standards centrally.
floe provides batteries-included OSS defaults that run on any Kubernetes cluster:
- Apache Iceberg: Open table format with ACID transactions
- Apache Polaris: Iceberg REST catalog
- DuckDB: High-performance analytics engine
- dbt: SQL transformation framework
- Dagster: Asset-centric orchestration
- Cube: Semantic layer and headless BI
- OpenTelemetry + OpenLineage: Observability and lineage standards
Not "integration hell": Pre-configured, tested together, deployable with one command. Or swap any component for your cloud service of choice.
- Getting Started: Quick Start Guide
- Configuration: Configuration Contracts (manifest.yaml + floe.yaml)
- Architecture: Four-Layer Model β’ Platform Enforcement
- Development: Contributing Guide β’ Code Standards
- ADRs: Architecture Decision Records
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Type safety: All code must pass
mypy --strict - Formatting: Black (100 char), enforced by ruff
- Testing: >80% coverage, 100% requirement traceability
- Security: No hardcoded secrets, Pydantic validation
- Architecture: Respect layer boundaries
Current (v0.1.0 - Pre-Alpha):
- Four-layer architecture
- Two-tier configuration
- Kubernetes-native deployment
- Compile-time validation
Next (v0.2.0 - Alpha):
- Complete K8s-native testing
- Plugin ecosystem docs
- CLI command suite
- External plugin support
Future (v1.0.0 - Production):
- Data Mesh extensions
- OCI registry integration
- Multi-environment workflows
Apache License 2.0 - See LICENSE for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions