Commit 8d4851f
feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer (#1821)
## Which issue does this PR close?
Partially address #1749.
## What changes are included in this PR?
This PR adds partition spec handling to `FileScanTask` and
`RecordBatchTransformer` to correctly implement the Iceberg spec's
"Column Projection" rules for fields "not present" in data files.
### Problem Statement
Prior to this PR, `iceberg-rust`'s `FileScanTask` had no mechanism to
pass partition information to `RecordBatchTransformer`, causing two
issues:
1. **Incorrect handling of bucket partitioning**: Couldn't distinguish
identity transforms (which should use partition metadata constants) from
non-identity transforms like bucket/truncate/year/month (which must read
from data file). For example, `bucket(4, id)` stores
`id_bucket = 2` (bucket number) in partition metadata, but actual `id`
values (100, 200, 300) are only in the data file. iceberg-rust was
incorrectly treating bucket-partitioned source columns as constants,
breaking runtime filtering and returning incorrect query results.
2. **Field ID conflicts in add_files scenarios**: When importing Hive
tables via `add_files`, partition columns could have field IDs
conflicting with Parquet data columns. Example: Parquet has
field_id=1→"name", but Iceberg expects field_id=1→"id" (partition). Per
spec, the
correct field is "not present" and requires name mapping fallback.
### Iceberg Specification Requirements
Per the Iceberg spec
(https://iceberg.apache.org/spec/#column-projection), when a field ID is
"not present" in a data file, it must be resolved using these rules:
1. Return the value from partition metadata if an **Identity Transform**
exists
2. Use `schema.name-mapping.default` metadata to map field id to columns
without field id
3. Return the default value if it has a defined `initial-default`
4. Return null in all other cases
**Why this matters:**
- **Identity transforms** (e.g., `identity(dept)`) store actual column
values in partition metadata that can be used as constants without
reading the data file
- **Non-identity transforms** (e.g., `bucket(4, id)`, `day(timestamp)`)
store transformed values in partition metadata (e.g., bucket number 2,
not the actual `id` values 100, 200, 300) and must read source columns
from the data file
### Changes Made
1. **Added partition fields to `FileScanTask`** (`scan/task.rs`):
- `partition: Option<Struct>` - Partition data from manifest entry
- `partition_spec: Option<Arc<PartitionSpec>>` - For transform-aware
constant detection
- `name_mapping: Option<Arc<NameMapping>>` - Name mapping from table
metadata
2. **Implemented `constants_map()` function**
(`arrow/record_batch_transformer.rs`):
- Replicates Java's `PartitionUtil.constantsMap()` behavior
- Only includes fields where transform is `Transform::Identity`
- Used to determine which fields use partition metadata constants vs.
reading from data files
3. **Enhanced `RecordBatchTransformer`**
(`arrow/record_batch_transformer.rs`):
- Added `build_with_partition_data()` method to accept partition spec,
partition data, and name mapping
- Implements all 4 spec rules for column resolution with
identity-transform awareness
- Detects field ID conflicts by verifying both field ID AND name match
- Falls back to name mapping when field IDs are missing/conflicting
(spec rule #2)
4. **Updated `ArrowReader`** (`arrow/reader.rs`):
- Uses `build_with_partition_data()` when partition information is
available
- Falls back to `build()` when not available
5. **Updated manifest entry processing** (`scan/context.rs`):
- Populates partition fields in `FileScanTask` from manifest entry data
### Tests Added
1. **`bucket_partitioning_reads_source_column_from_file`** - Verifies
that bucket-partitioned source columns are read from data files (not
treated as constants from partition metadata)
2. **`identity_partition_uses_constant_from_metadata`** - Verifies that
identity-transformed fields correctly use partition metadata constants
3. **`test_bucket_partitioning_with_renamed_source_column`** - Verifies
field-ID-based mapping works despite column rename
4. **`add_files_partition_columns_without_field_ids`** - Verifies name
mapping resolution for Hive table imports without field IDs (spec rule
#2)
5. **`add_files_with_true_field_id_conflict`** - Verifies correct field
ID conflict detection with name mapping fallback (spec rule #2)
6. **`test_all_four_spec_rules`** - Integration test verifying all 4
spec rules work together
## Are these changes tested?
Yes, there are 6 new unit tests covering all 4 Iceberg spec rules. This
also resolved approximately 50 Iceberg Java tests when running with
DataFusion Comet's experimental
apache/datafusion-comet#2528 PR.
---------
Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>1 parent d1927db commit 8d4851f
File tree
8 files changed
+1222
-91
lines changed- crates/iceberg/src
- arrow
- scan
8 files changed
+1222
-91
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
881 | 881 | | |
882 | 882 | | |
883 | 883 | | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
884 | 887 | | |
885 | 888 | | |
886 | 889 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
85 | | - | |
| 85 | + | |
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
341 | 341 | | |
342 | 342 | | |
343 | 343 | | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
344 | 347 | | |
345 | 348 | | |
346 | 349 | | |
| |||
352 | 355 | | |
353 | 356 | | |
354 | 357 | | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
355 | 361 | | |
356 | 362 | | |
357 | 363 | | |
| |||
0 commit comments