Skip to content

Conversation

ShivramSriramulu
Copy link

Summary

This PR enhances MirrorMaker 2 (MM2) with fault-tolerance capabilities to address critical data loss scenarios in cross-cluster replication setups.

Problem Statement

Vanilla MM2 has two critical gaps:

  1. Silent Data Loss: Retention policies may purge messages before replication completes, creating undetectable gaps
  2. Service Disruption: Topic delete/recreate operations can cause replication failures or stalls

Solution

Added fault-tolerance enhancements to MirrorSourceTask:

Fail-Fast Truncation Detection

  • Catches OffsetOutOfRangeException during consumer polling
  • Logs detailed diagnostics with partition assignments and earliest offsets
  • Throws ConnectException to fail-fast and alert operators immediately
  • Configurable via mirrorsource.fail.on.truncation=true (default)

Graceful Topic Reset Handling

  • Uses AdminClient to track topic IDs and detect delete/recreate events
  • Automatically seeks to beginning offset for reset topics
  • Handles UnknownTopicOrPartitionException with retry logic
  • Configurable via mirrorsource.auto.recover.on.reset=true (default)

Technical Details

  • File Modified: connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java
  • Lines Added: ~75 LOC (well under 500 LOC requirement)
  • Backward Compatibility: Maintained - all changes are additive
  • Configuration: New properties with sensible defaults
  • Logging: Uses dedicated logger mm2.fault.tolerance for easy filtering

Testing

Impact

  • RPO Improvement: Makes data loss immediately visible instead of silent
  • RTO Improvement: Reduces manual intervention during maintenance
  • Operational: Clear error messages for troubleshooting
  • Production Ready: Minimal performance impact, configurable behavior

- Add fail-fast truncation detection with detailed error logging
- Add graceful topic reset handling with auto-recovery
- Add configuration toggles for fault tolerance features
- Add AdminClient-based topic ID tracking for reset detection
- Add seekToBeginning for topic reset recovery
- Maintain backward compatibility with existing MM2 behavior

Features:
- mirrorsource.fail.on.truncation=true (default)
- mirrorsource.auto.recover.on.reset=true (default)
- mirrorsource.topic.reset.retry.ms=5000 (default)

This addresses silent data loss scenarios and improves resilience
during planned maintenance operations involving topic resets.
@github-actions github-actions bot added triage PRs from the community connect mirror-maker-2 labels Sep 9, 2025
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant