|
| 1 | +<!-- |
| 2 | + ~ Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + ~ or more contributor license agreements. See the NOTICE file |
| 4 | + ~ distributed with this work for additional information |
| 5 | + ~ regarding copyright ownership. The ASF licenses this file |
| 6 | + ~ to you under the Apache License, Version 2.0 (the |
| 7 | + ~ "License"); you may not use this file except in compliance |
| 8 | + ~ with the License. You may obtain a copy of the License at |
| 9 | + ~ |
| 10 | + ~ http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | + ~ |
| 12 | + ~ Unless required by applicable law or agreed to in writing, |
| 13 | + ~ software distributed under the License is distributed on an |
| 14 | + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + ~ KIND, either express or implied. See the License for the |
| 16 | + ~ specific language governing permissions and limitations |
| 17 | + ~ under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# RFC: Modularize `iceberg` Implementations |
| 21 | + |
| 22 | +## Background |
| 23 | + |
| 24 | +Issue #1819 highlighted that the current `iceberg` crate mixes the Iceberg protocol abstractions (catalog/table/plan/transaction) with concrete runtime, storage, and execution code (Tokio runtime wrappers, opendal-based `FileIO`, Arrow helpers, DataFusion glue, etc.). This coupling makes the crate heavy and blocks users from composing their own storage or execution stacks. |
| 25 | + |
| 26 | +Two principles have been agreed: |
| 27 | +1. The `iceberg` crate remains the single source of truth for all protocol traits and data structures. We will not create a separate “kernel” crate or facade layer. |
| 28 | +2. Concrete integrations (Tokio runtime, opendal `FileIO`, Arrow/DataFusion glue, catalog adapters, etc.) move out into dedicated companion crates. Users needing a ready path can depend on those crates (e.g., `iceberg-datafusion` or `integrations/local`), while custom stacks depend only on `iceberg`. |
| 29 | + |
| 30 | +This RFC focuses on modularizing implementations; detailed trait signatures (e.g., `FileIO`, `Runtime`) will be handled in separate RFCs. |
| 31 | + |
| 32 | +## Goals and Scope |
| 33 | + |
| 34 | +- Keep `iceberg` as the protocol crate (traits + metadata + planning), without bundling runtimes, storage adapters, or execution glue. |
| 35 | +- Relocate concrete code into companion crates under `crates/fileio/*`, `crates/runtime/*`, and `crates/integrations/*`. |
| 36 | +- Provide a staged plan for extracting Arrow-dependent APIs to avoid destabilizing file-format code. |
| 37 | +- Minimize breaking surfaces: traits stay in `iceberg`; downstream crates mainly adjust dependencies. |
| 38 | + |
| 39 | +Out of scope: changes to the Iceberg table specification or catalog adapter external behavior; detailed trait method design (covered by follow-up RFCs). |
| 40 | + |
| 41 | +## Architecture Overview |
| 42 | + |
| 43 | +### Workspace Layout (target) |
| 44 | + |
| 45 | +``` |
| 46 | +crates/ |
| 47 | + iceberg/ # core traits, metadata, planning, transactions |
| 48 | + fileio/ |
| 49 | + opendal/ # e.g. `iceberg-fileio-opendal` |
| 50 | + fs/ # other FileIO implementations |
| 51 | + runtime/ |
| 52 | + tokio/ # e.g. `iceberg-runtime-tokio` |
| 53 | + smol/ |
| 54 | + catalog/* # catalog adapters (REST, HMS, Glue, etc.) |
| 55 | + integrations/ |
| 56 | + local/ # simple local/arrow-based helper crate |
| 57 | + datafusion/ # combines core + implementations for DF |
| 58 | + cache-moka/ |
| 59 | + playground/ |
| 60 | +``` |
| 61 | + |
| 62 | +- `crates/iceberg` drops direct deps on opendal, Tokio, Arrow, and DataFusion. |
| 63 | +- Implementation crates depend on `iceberg` to implement the traits. |
| 64 | +- Higher-level crates (`integrations/local`, `iceberg-datafusion`) assemble the pieces for ready-to-use scenarios. |
| 65 | + |
| 66 | +### Core Trait Surfaces |
| 67 | + |
| 68 | +`FileIO`, `Runtime`, `Catalog`, `Table`, `Transaction`, `TableScan` (plan descriptors) all remain hosted in `iceberg`. Precise method signatures are deferred to dedicated RFCs to avoid locking details prematurely. |
| 69 | + |
| 70 | +### Usage Modes |
| 71 | + |
| 72 | +- **Custom stacks**: depend on `iceberg` and provide your own implementations. |
| 73 | +- **Pre-built stacks**: depend on `integrations/local` or `iceberg-datafusion`, which bundle `iceberg` with selected runtime/FileIO/Arrow helpers. |
| 74 | +- `iceberg` does not re-export companion crates; users compose explicitly. |
| 75 | + |
| 76 | +## Migration Plan (staged, with Arrow extraction phased) |
| 77 | + |
| 78 | +1. **Phase 1 – Confirm trait hosting, defer details** |
| 79 | + - Keep all protocol traits in `iceberg`; move detailed API design (FileIO, Runtime, etc.) to separate RFCs. |
| 80 | + - Add temporary shims/deprecations only when traits are finalized. |
| 81 | + |
| 82 | +2. **Phase 2 – First Arrow step: move `to_arrow()` out** |
| 83 | + - Relocate the public `to_arrow()` API to `integrations/local` (or another higher-level crate). Core no longer exposes Arrow entry points. |
| 84 | + - Keep internal Arrow-dependent helpers (e.g., `ArrowFileReader`) temporarily in `iceberg` to avoid breaking file-format flows. |
| 85 | + |
| 86 | +3. **Phase 3 – Gradual Arrow dependency removal** |
| 87 | + - Incrementally migrate/replace Arrow-dependent internals (`ArrowFileReader`, format-specific readers) into `integrations/local` or other helper crates. |
| 88 | + - Adjust file-format APIs as needed; expect this to be multi-release work. |
| 89 | + |
| 90 | +4. **Phase 4 – Dependency cleanup** |
| 91 | + - Ensure catalog and integration crates depend only on `iceberg` plus the specific runtime/FileIO/helper crates they need. |
| 92 | + - Verify build/test pipelines against the new dependency graph. |
| 93 | + |
| 94 | +5. **Phase 5 – Docs & release** |
| 95 | + - Publish migration guides: where `to_arrow()` moved, how to assemble local/DataFusion stacks. |
| 96 | + - Schedule deprecation windows for remaining Arrow helpers; target a breaking release once Arrow is fully removed from `iceberg`. |
| 97 | + |
| 98 | +## Compatibility |
| 99 | + |
| 100 | +- Short term: users of `Table::scan().to_arrow()` must switch to `integrations/local` (or another crate that rehosts that API). Other Arrow types stay temporarily but will migrate in later phases. |
| 101 | +- Long term: `iceberg` will be Arrow-free; companion crates provide Arrow-based helpers. |
| 102 | +- Tests/examples move alongside the implementations they exercise. |
| 103 | + |
| 104 | +## Risks and Mitigations |
| 105 | + |
| 106 | +| Risk | Description | Mitigation | |
| 107 | +| ---- | ----------- | ---------- | |
| 108 | +| Arrow dependency unwinding is complex | File-format readers may rely on Arrow types | Phase the work; move `to_arrow()` first, then refactor readers; document interim state | |
| 109 | +| Discoverability | Users may not know where Arrow helpers went | Clear docs pointing to `integrations/local` and `iceberg-datafusion`; migration guide | |
| 110 | +| Trait churn | Future trait RFCs may break early adopters | Use deprecation shims and communicate timelines | |
| 111 | +| Duplicate impls | Multiple helper crates could overlap | Provide recommended combinations and feature guidance | |
| 112 | + |
| 113 | +## Open Questions |
| 114 | + |
| 115 | +1. Versioning: align companion crate versions with `iceberg`, or allow independent versions plus compatibility matrix? |
| 116 | +2. Deprecation schedule: how long do we keep interim Arrow helpers before full removal from `iceberg`? |
| 117 | + |
| 118 | +## Conclusion |
| 119 | + |
| 120 | +We will keep `iceberg` as the protocol crate while modularizing concrete implementations. Arrow removal will be phased: first relocating `to_arrow()` to `integrations/local`, then gradually moving Arrow-dependent readers and helpers. This keeps the core lean, lets users compose their preferred runtime/FileIO stacks, and still offers ready-to-use combinations via companion crates. |
0 commit comments