Skip to content

Commit 052feaf

Browse files
Xuanwokevinjqliualambgithub-actions[bot]
authored
rfc: Modularize iceberg Implementations (#1854)
## Which issue does this PR close? - Part of #1819 ## What changes are included in this PR? Add RFC for iceberg-kernel ## Are these changes tested? --------- Signed-off-by: Xuanwo <github@xuanwo.io> Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 5724fc5 commit 052feaf

File tree

1 file changed

+120
-0
lines changed

1 file changed

+120
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
<!--
2+
~ Licensed to the Apache Software Foundation (ASF) under one
3+
~ or more contributor license agreements. See the NOTICE file
4+
~ distributed with this work for additional information
5+
~ regarding copyright ownership. The ASF licenses this file
6+
~ to you under the Apache License, Version 2.0 (the
7+
~ "License"); you may not use this file except in compliance
8+
~ with the License. You may obtain a copy of the License at
9+
~
10+
~ http://www.apache.org/licenses/LICENSE-2.0
11+
~
12+
~ Unless required by applicable law or agreed to in writing,
13+
~ software distributed under the License is distributed on an
14+
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
~ KIND, either express or implied. See the License for the
16+
~ specific language governing permissions and limitations
17+
~ under the License.
18+
-->
19+
20+
# RFC: Modularize `iceberg` Implementations
21+
22+
## Background
23+
24+
Issue #1819 highlighted that the current `iceberg` crate mixes the Iceberg protocol abstractions (catalog/table/plan/transaction) with concrete runtime, storage, and execution code (Tokio runtime wrappers, opendal-based `FileIO`, Arrow helpers, DataFusion glue, etc.). This coupling makes the crate heavy and blocks users from composing their own storage or execution stacks.
25+
26+
Two principles have been agreed:
27+
1. The `iceberg` crate remains the single source of truth for all protocol traits and data structures. We will not create a separate “kernel” crate or facade layer.
28+
2. Concrete integrations (Tokio runtime, opendal `FileIO`, Arrow/DataFusion glue, catalog adapters, etc.) move out into dedicated companion crates. Users needing a ready path can depend on those crates (e.g., `iceberg-datafusion` or `integrations/local`), while custom stacks depend only on `iceberg`.
29+
30+
This RFC focuses on modularizing implementations; detailed trait signatures (e.g., `FileIO`, `Runtime`) will be handled in separate RFCs.
31+
32+
## Goals and Scope
33+
34+
- Keep `iceberg` as the protocol crate (traits + metadata + planning), without bundling runtimes, storage adapters, or execution glue.
35+
- Relocate concrete code into companion crates under `crates/fileio/*`, `crates/runtime/*`, and `crates/integrations/*`.
36+
- Provide a staged plan for extracting Arrow-dependent APIs to avoid destabilizing file-format code.
37+
- Minimize breaking surfaces: traits stay in `iceberg`; downstream crates mainly adjust dependencies.
38+
39+
Out of scope: changes to the Iceberg table specification or catalog adapter external behavior; detailed trait method design (covered by follow-up RFCs).
40+
41+
## Architecture Overview
42+
43+
### Workspace Layout (target)
44+
45+
```
46+
crates/
47+
iceberg/ # core traits, metadata, planning, transactions
48+
fileio/
49+
opendal/ # e.g. `iceberg-fileio-opendal`
50+
fs/ # other FileIO implementations
51+
runtime/
52+
tokio/ # e.g. `iceberg-runtime-tokio`
53+
smol/
54+
catalog/* # catalog adapters (REST, HMS, Glue, etc.)
55+
integrations/
56+
local/ # simple local/arrow-based helper crate
57+
datafusion/ # combines core + implementations for DF
58+
cache-moka/
59+
playground/
60+
```
61+
62+
- `crates/iceberg` drops direct deps on opendal, Tokio, Arrow, and DataFusion.
63+
- Implementation crates depend on `iceberg` to implement the traits.
64+
- Higher-level crates (`integrations/local`, `iceberg-datafusion`) assemble the pieces for ready-to-use scenarios.
65+
66+
### Core Trait Surfaces
67+
68+
`FileIO`, `Runtime`, `Catalog`, `Table`, `Transaction`, `TableScan` (plan descriptors) all remain hosted in `iceberg`. Precise method signatures are deferred to dedicated RFCs to avoid locking details prematurely.
69+
70+
### Usage Modes
71+
72+
- **Custom stacks**: depend on `iceberg` and provide your own implementations.
73+
- **Pre-built stacks**: depend on `integrations/local` or `iceberg-datafusion`, which bundle `iceberg` with selected runtime/FileIO/Arrow helpers.
74+
- `iceberg` does not re-export companion crates; users compose explicitly.
75+
76+
## Migration Plan (staged, with Arrow extraction phased)
77+
78+
1. **Phase 1 – Confirm trait hosting, defer details**
79+
- Keep all protocol traits in `iceberg`; move detailed API design (FileIO, Runtime, etc.) to separate RFCs.
80+
- Add temporary shims/deprecations only when traits are finalized.
81+
82+
2. **Phase 2 – First Arrow step: move `to_arrow()` out**
83+
- Relocate the public `to_arrow()` API to `integrations/local` (or another higher-level crate). Core no longer exposes Arrow entry points.
84+
- Keep internal Arrow-dependent helpers (e.g., `ArrowFileReader`) temporarily in `iceberg` to avoid breaking file-format flows.
85+
86+
3. **Phase 3 – Gradual Arrow dependency removal**
87+
- Incrementally migrate/replace Arrow-dependent internals (`ArrowFileReader`, format-specific readers) into `integrations/local` or other helper crates.
88+
- Adjust file-format APIs as needed; expect this to be multi-release work.
89+
90+
4. **Phase 4 – Dependency cleanup**
91+
- Ensure catalog and integration crates depend only on `iceberg` plus the specific runtime/FileIO/helper crates they need.
92+
- Verify build/test pipelines against the new dependency graph.
93+
94+
5. **Phase 5 – Docs & release**
95+
- Publish migration guides: where `to_arrow()` moved, how to assemble local/DataFusion stacks.
96+
- Schedule deprecation windows for remaining Arrow helpers; target a breaking release once Arrow is fully removed from `iceberg`.
97+
98+
## Compatibility
99+
100+
- Short term: users of `Table::scan().to_arrow()` must switch to `integrations/local` (or another crate that rehosts that API). Other Arrow types stay temporarily but will migrate in later phases.
101+
- Long term: `iceberg` will be Arrow-free; companion crates provide Arrow-based helpers.
102+
- Tests/examples move alongside the implementations they exercise.
103+
104+
## Risks and Mitigations
105+
106+
| Risk | Description | Mitigation |
107+
| ---- | ----------- | ---------- |
108+
| Arrow dependency unwinding is complex | File-format readers may rely on Arrow types | Phase the work; move `to_arrow()` first, then refactor readers; document interim state |
109+
| Discoverability | Users may not know where Arrow helpers went | Clear docs pointing to `integrations/local` and `iceberg-datafusion`; migration guide |
110+
| Trait churn | Future trait RFCs may break early adopters | Use deprecation shims and communicate timelines |
111+
| Duplicate impls | Multiple helper crates could overlap | Provide recommended combinations and feature guidance |
112+
113+
## Open Questions
114+
115+
1. Versioning: align companion crate versions with `iceberg`, or allow independent versions plus compatibility matrix?
116+
2. Deprecation schedule: how long do we keep interim Arrow helpers before full removal from `iceberg`?
117+
118+
## Conclusion
119+
120+
We will keep `iceberg` as the protocol crate while modularizing concrete implementations. Arrow removal will be phased: first relocating `to_arrow()` to `integrations/local`, then gradually moving Arrow-dependent readers and helpers. This keeps the core lean, lets users compose their preferred runtime/FileIO stacks, and still offers ready-to-use combinations via companion crates.

0 commit comments

Comments
 (0)