Tiering support commit by time #2185

luoyuxia · 2025-12-16T09:23:52Z

Purpose

Linked issue: close #1893

Brief change log

Tests

API and Format

Documentation

Copilot

Pull request overview

This PR implements tiering timeout support to prevent long-running table tiering operations from blocking other tables. The feature introduces a maximum duration for tiering a single table, after which it will be force-completed or skipped.

Key Changes

Added forceIgnore flag to TieringSplit and its subclasses to mark splits that should be skipped due to timeout
Implemented periodic timeout checking in TieringSourceEnumerator with configurable max duration and detection interval
Introduced TieringReachMaxDurationEvent to notify readers when a table reaches max tiering duration
Updated split handling in TieringSplitReader to force-complete in-progress log splits and skip new splits when timeout occurs

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
TieringSplit.java	Added `forceIgnore` field and methods to mark splits for skipping
TieringSnapshotSplit.java	Added constructors supporting `forceIgnore` parameter
TieringLogSplit.java	Added constructors supporting `forceIgnore` parameter
TieringSplitSerializer.java	Updated serialization to include `forceIgnore` field
TieringSourceEnumerator.java	Implemented periodic timeout checking, deadline tracking, and timeout event broadcasting
TieringSplitReader.java	Added logic to force-complete log splits and skip splits marked with `forceIgnore`
TieringSourceReader.java	Integrated timeout event handling and custom fetcher manager
TieringSourceFetcherManager.java	New class to manage timeout notifications to split readers
TieringReachMaxDurationEvent.java	New event class to signal table timeout to readers
TieringSourceOptions.java	Added configuration options for max duration and detection interval
TieringSource.java	Updated builders to support new timeout configuration parameters
LakeTieringJobBuilder.java	Wired up timeout configuration from Fluss config
TieringSplitGenerator.java	Removed hardcoded `numberOfSplits` parameter, changed log levels to DEBUG
TieringSplitSerializerTest.java	Added tests for `forceIgnore` serialization and updated string representations
TieringSourceEnumeratorTest.java	Added timeout test, refactored assertions to use `containsExactlyInAnyOrderElementsOf`, added helper methods
TieringSourceReaderTest.java	New test file for testing timeout event handling at reader level

Comments suppressed due to low confidence (2)

fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/split/TieringSplitGenerator.java:339

The log level was changed from INFO to DEBUG for this message about skipping splits when offset conditions are met. This is appropriate as it reduces log verbosity for expected behavior. However, ensure this doesn't make it harder to diagnose why certain tables aren't being tiered, as this condition might occur frequently during normal operation and could be useful for troubleshooting.

                LOG.debug(
                        "The lastCommittedBucketOffset {} is equals or bigger than latestBucketOffset {}, skip generating split for bucket {}",
                        lastCommittedBucketOffset,
                        latestBucketOffset,
                        tableBucket);
                return Optional.empty();
            }
        }
    }

    private Optional<TieringSplit> generateSplitForLogTableBucket(
            TablePath tablePath,
            TableBucket tableBucket,
            @Nullable String partitionName,
            @Nullable Long lastCommittedBucketOffset,
            long latestBucketOffset) {
        if (latestBucketOffset <= 0) {
            LOG.debug(
                    "The latestBucketOffset {} is equals or less than 0, skip generating split for bucket {}",
                    latestBucketOffset,
                    tableBucket);
            return Optional.empty();
        }

        // the bucket is never been tiered
        if (lastCommittedBucketOffset == null) {
            // the bucket is never been tiered, scan fluss log from the earliest offset
            return Optional.of(
                    new TieringLogSplit(
                            tablePath,
                            tableBucket,
                            partitionName,
                            EARLIEST_OFFSET,
                            latestBucketOffset));
        } else {
            // the bucket has been tiered, scan remain fluss log
            if (lastCommittedBucketOffset < latestBucketOffset) {
                return Optional.of(
                        new TieringLogSplit(
                                tablePath,
                                tableBucket,
                                partitionName,
                                lastCommittedBucketOffset,
                                latestBucketOffset));
            }
        }
        LOG.debug(
                "The lastCommittedBucketOffset {} is equals or bigger than latestBucketOffset {}, skip generating split for bucket {}",
                lastCommittedBucketOffset,
                latestBucketOffset,
                tableBucket);

fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java:168

When a table reaches the max tiering duration, the timeout check only forces completion for log splits (lines 159-161 and 292). However, snapshot splits can also be in progress. If a table times out while processing snapshot splits, they won't be force-completed, leading to potential hangs or inconsistent behavior. Consider extending the timeout handling to snapshot splits as well.

        // may read snapshot firstly
        if (currentSnapshotSplitReader != null) {
            // for snapshot split, we don't force to complete it
            // since we rely on the log offset for the snapshot to
            // do next tiering, if force to complete, we can't get the log offset
            CloseableIterator<RecordAndPos> recordIterator = currentSnapshotSplitReader.readBatch();
            if (recordIterator == null) {
                LOG.info("Split {} is finished", currentSnapshotSplit.splitId());
                return finishCurrentSnapshotSplit();
            } else {
                return forSnapshotSplitRecords(
                        currentSnapshotSplit.getTableBucket(), recordIterator);
            }
        } else {
            if (currentLogScanner != null) {
                if (timeoutTables.contains(currentTableId)) {
                    return forceCompleteTieringLogRecords();
                }
                ScanRecords scanRecords = currentLogScanner.poll(pollTimeout);
                // force to complete records
                return forLogRecords(scanRecords);
            } else {
                return emptyTableBucketWriteResultWithSplitIds();
            }
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.../src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java

Copilot · 2025-12-17T12:02:28Z

.../src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java

+            for (TieringSplit tieringSplit : pendingSplits) {
+                if (tieringSplit.getTableBucket().getTableId() == reachMaxDurationTable) {
+                    // force ignore this tiering split since the tiering for this table is timeout,
+                    // we have to force to set to ignore the tiering split so that the
+                    // tiering source reader can ignore them directly
+                    tieringSplit.forceIgnore();
+                } else {
+                    // we can break directly, if found any one split's table id is not equal to the
+                    // timeout
+                    // table, the following split must be not equal to the table id
+                    break;


The pendingSplits list is accessed both from the periodic timeout check (line 301) and the assignSplits method (line 350), which can be called from different threads via handleSplitRequest and addSplitsBack. While assignSplits synchronizes on readersAwaitingSplit, the timeout check doesn't use any synchronization, which could lead to a ConcurrentModificationException or inconsistent state when iterating and modifying pendingSplits concurrently.

Suggested change

for (TieringSplit tieringSplit : pendingSplits) {

if (tieringSplit.getTableBucket().getTableId() == reachMaxDurationTable) {

// force ignore this tiering split since the tiering for this table is timeout,

// we have to force to set to ignore the tiering split so that the

// tiering source reader can ignore them directly

tieringSplit.forceIgnore();

} else {

// we can break directly, if found any one split's table id is not equal to the

// timeout

// table, the following split must be not equal to the table id

break;

// Access to pendingSplits must be synchronized consistently with other paths

// (e.g., assignSplits) to avoid concurrent modification.

synchronized (readersAwaitingSplit) {

for (TieringSplit tieringSplit : pendingSplits) {

if (tieringSplit.getTableBucket().getTableId() == reachMaxDurationTable) {

// force ignore this tiering split since the tiering for this table is

// timeout, we have to force to set to ignore the tiering split so that the

// tiering source reader can ignore them directly

tieringSplit.forceIgnore();

} else {

// we can break directly, if found any one split's table id is not equal to

// the timeout table, the following split must be not equal to the table id

break;

}

handleReachMaxTieringDurationTables,handleSplitRequest, addSplitsBack is done in a single thread

Copilot · 2025-12-17T12:02:28Z

.../src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java


    private void assignSplits() {
        /* This method may be called from both addSplitsBack and handleSplitRequest, make it thread safe. */
+        // todo: do we need to add lock?


The TODO comment at line 340 asks "do we need to add lock?" but this question is now more critical given that the timeout checking code (lines 301-312) accesses pendingSplits without synchronization while assignSplits() synchronizes on readersAwaitingSplit. This creates a potential race condition. The TODO should either be resolved by adding proper synchronization or be updated to reflect the new concurrency concerns.

...link-common/src/test/java/org/apache/fluss/flink/tiering/source/TieringSourceReaderTest.java

...uss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java

...ss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSourceReader.java

...common/src/main/java/org/apache/fluss/flink/tiering/source/split/TieringSplitSerializer.java

...uss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java

...-common/src/main/java/org/apache/fluss/flink/tiering/event/TieringReachMaxDurationEvent.java

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/enumerator/TieringSourceEnumerator.java:350

There's a potential race condition between reading and modifying pendingSplits. The handleReachMaxTieringDurationTables method (lines 301-313) iterates and modifies pendingSplits, while assignSplits (line 350) also modifies it. Both methods can be called concurrently from different async callbacks, but only readersAwaitingSplit is synchronized, not pendingSplits. This could lead to concurrent modification exceptions or inconsistent state.

        // todo: do we need to add lock?
        synchronized (readersAwaitingSplit) {
            if (!readersAwaitingSplit.isEmpty()) {
                final Integer[] readers = readersAwaitingSplit.toArray(new Integer[0]);
                for (Integer nextAwaitingReader : readers) {
                    if (!context.registeredReaders().containsKey(nextAwaitingReader)) {
                        readersAwaitingSplit.remove(nextAwaitingReader);
                        continue;
                    }
                    if (!pendingSplits.isEmpty()) {
                        TieringSplit tieringSplit = pendingSplits.remove(0);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T09:07:57Z

...uss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java

+     * will be force completed in the next fetch cycle.
+     */
+    public void handleTableReachTieringDeadline(long tableId) {
+        if ((currentTableId != null && currentTableId.equals(tableId)


The condition on line 537 is missing parentheses around the OR operation, which could lead to incorrect logic. The current code evaluates as (currentTableId != null && currentTableId.equals(tableId)) || pendingTieringSplits.containsKey(tableId), but the intent appears to be checking if the table ID matches either the current table or is in pending splits. Add parentheses for clarity: ((currentTableId != null && currentTableId.equals(tableId)) || pendingTieringSplits.containsKey(tableId)).

Suggested change

if ((currentTableId != null && currentTableId.equals(tableId)

if (((currentTableId != null && currentTableId.equals(tableId))

Copilot · 2025-12-25T09:07:57Z

fluss-server/src/test/java/org/apache/fluss/server/lakehouse/TestingPaimonStoragePlugin.java

+        public long commit(
+                TestPaimonCommittable committable, Map<String, String> snapshotProperties)
+                throws IOException {
+            // do nothing, and always return 1 as commited snapshot


The word "commited" should be spelled "committed" (with double 't').

Suggested change

// do nothing, and always return 1 as commited snapshot

// do nothing, and always return 1 as committed snapshot

Copilot · 2025-12-25T09:07:58Z

fluss-server/src/test/java/org/apache/fluss/server/lakehouse/TestingPaimonStoragePlugin.java

+                    Thread.sleep(writePause.toMillis());
+                }
+            } catch (InterruptedException e) {
+                Thread.currentThread().interrupt();


After catching InterruptedException, the thread's interrupted status is set via Thread.currentThread().interrupt() but the method continues execution normally. This could lead to unexpected behavior since the write operation completes normally despite being interrupted. Consider either propagating the interruption as an IOException or rethinking the interruption handling strategy.

Suggested change

Thread.currentThread().interrupt();

Thread.currentThread().interrupt();

throw new IOException("Interrupted while pausing before write", e);

Copilot · 2025-12-25T09:07:58Z

fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java

+                                    + "it will be force completed: the tiering will be finalized and committed to the data lake "
+                                    + "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets.");


The description mentions "even if they haven't reached their desired stopping offsets" but should clarify that this applies to log splits. For snapshot splits, the current implementation (as shown in TieringSplitReader lines 147-149) does not force complete them, which is intentional to preserve the log offset needed for the next tiering round. Consider adding a note in the documentation about this distinction between log and snapshot splits.

Suggested change

+ "it will be force completed: the tiering will be finalized and committed to the data lake "

+ "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets.");

+ "it will be force completed: the tiering of log splits will be finalized and committed to the data lake "

+ "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets. "

+ "Snapshot splits are not force completed in this way, so that the log offset needed for the next tiering round can be preserved.");

Copilot · 2025-12-25T09:07:58Z

...k/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/LakeTieringJobBuilder.java

+
+        if (lakeTieringConfig.get(LAKE_TIERING_TABLE_DURATION_DETECT_INTERVAL) != null) {
+            tieringSourceBuilder.withTieringTableDurationDetectInterval(
+                    flussConfig.get(LAKE_TIERING_TABLE_DURATION_DETECT_INTERVAL).toMillis());


The code is reading configuration from lakeTieringConfig but using it from flussConfig when passing it to the builder. The LAKE_TIERING_TABLE_DURATION_DETECT_INTERVAL should be retrieved from lakeTieringConfig (consistent with the check on line 100), not from flussConfig.

Suggested change

flussConfig.get(LAKE_TIERING_TABLE_DURATION_DETECT_INTERVAL).toMillis());

lakeTieringConfig

.get(LAKE_TIERING_TABLE_DURATION_DETECT_INTERVAL)

.toMillis());

Copilot · 2025-12-25T09:07:58Z

...uss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java

+                LOG.info(
+                        "Split {} is forced to be finished due to tiering timeout.",
+                        split.splitId());
+                currentTieringSplitsIterator.remove();


When forcing completion due to timeout, the code only handles TieringLogSplit splits (line 296) but not TieringSnapshotSplit splits. If a table has pending snapshot splits when the deadline is reached, they will remain in currentTableSplitsByBucket and never be completed, potentially causing the tiering operation to hang or fail to finish the table properly.

Suggested change

currentTieringSplitsIterator.remove();

currentTieringSplitsIterator.remove();

} else if (split instanceof TieringSnapshotSplit) {

// Snapshot splits do not rely on log offsets here, but must be marked finished

// and removed so they do not block table completion when a timeout occurs.

finishedSplitIds.put(bucket, split.splitId());

LOG.info(

"Snapshot split {} is forced to be finished due to tiering timeout.",

split.splitId());

currentTieringSplitsIterator.remove();

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

luoyuxia force-pushed the tiering-support-commit-by-time branch from 38395e1 to 46c293e Compare December 17, 2025 11:42

luoyuxia requested a review from Copilot December 17, 2025 11:46

Copilot started reviewing on behalf of luoyuxia December 17, 2025 11:46 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

luoyuxia added 4 commits December 25, 2025 13:57

tiering support commit by time

501f0c2

[lake] support lake tiering service force to commit due to timeout

b2b3691

[lake] support lake tiering service force to commit due to timeout

15bc8f8

add it

3b4b781

luoyuxia force-pushed the tiering-support-commit-by-time branch 3 times, most recently from 466629a to bbd9d38 Compare December 25, 2025 08:55

luoyuxia requested a review from Copilot December 25, 2025 08:58

Copilot started reviewing on behalf of luoyuxia December 25, 2025 08:58 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

refactor tiering options

9ae01ad

luoyuxia force-pushed the tiering-support-commit-by-time branch from bbd9d38 to 9ae01ad Compare December 25, 2025 09:55

	if ((currentTableId != null && currentTableId.equals(tableId)
	if (((currentTableId != null && currentTableId.equals(tableId))

	// do nothing, and always return 1 as commited snapshot
	// do nothing, and always return 1 as committed snapshot

	Thread.currentThread().interrupt();
	Thread.currentThread().interrupt();
	throw new IOException("Interrupted while pausing before write", e);

		+ "it will be force completed: the tiering will be finalized and committed to the data lake "
		+ "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets.");

-                                    + "it will be force completed: the tiering will be finalized and committed to the data lake "
-                                    + "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets.");
+                                    + "it will be force completed: the tiering of log splits will be finalized and committed to the data lake "
+                                    + "(e.g., Paimon) immediately, even if they haven't reached their desired stopping offsets. "
+                                    + "Snapshot splits are not force completed in this way, so that the log offset needed for the next tiering round can be preserved.");

Tiering support commit by time #2185

Are you sure you want to change the base?

Tiering support commit by time #2185

Conversation

luoyuxia commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

luoyuxia Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luoyuxia commented Dec 16, 2025 •

edited

Loading