-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54305][SQL][PYTHON] Add admission control support to Python DataSource streaming API #53085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jiteshsoni
wants to merge
7
commits into
apache:master
Choose a base branch
from
jiteshsoni:test-build-branch
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,003
−166
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
53f0fb1 to
526db14
Compare
…taSource streaming API This PR adds admission control support to Python streaming data sources by: 1. Enhanced Python API: Modified DataSourceStreamReader.latestOffset() to accept optional start and limit parameters 2. Scala Bridge: Updated PythonMicroBatchStream to validate admission control options and throw IllegalArgumentException for invalid values 3. Python-JVM Communication: Added function IDs and handlers in PythonStreamingSourceRunner for admission control parameters 4. Python Worker: Implemented latest_offset_with_report_func() with backward compatibility fallback 5. Documentation: Added comprehensive guide in python_data_source.rst 6. Tests: Added validation tests for invalid admission control parameters 7. Example: Created structured_blockchain_admission_control.py demonstrating the Full API Key Benefits: - Predictable performance with controlled batch sizes - Rate limiting and backpressure support - Feature parity with Scala DataSource capabilities - Full backward compatibility (all parameters optional)
526db14 to
a355818
Compare
…ibility Change latestOffset() from throwing UnsupportedOperationException to bridging to latestOffset(Offset, ReadLimit) for backward compatibility. This fixes test failures in PythonStreamingDataSourceSuite where the old latestOffset() signature is still called by Spark's streaming execution engine.
- Fix null handling in latestOffset bridge method - Handle null start offset in PythonStreamingSourceRunner by converting to empty string - Update Python side to convert empty string back to None - Fix error action name to be 'latestOffset' instead of 'latestOffsetWithReport' All PythonStreamingDataSourceSuite tests now pass (8/8).
Add noqa: E501 comments to lines where Black and flake8 conflict. Black considers these lines correctly formatted, but they exceed 79 chars.
518b498 to
ffb28fd
Compare
This commit addresses several issues flagged by the CI pipeline: - Fixes Flake8 E501 (line too long) errors in 'structured_blockchain_admission_control.py' by refactoring docstrings and long lines. - Applies Black formatting to 'datasource_internal.py' and 'python_streaming_source_runner.py' to resolve pre-existing formatting inconsistencies.
ffb28fd to
533ddb0
Compare
Contributor
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds admission control support to Python streaming data sources, enabling users to control microbatch sizes through
maxRecordsPerBatch,maxFilesPerBatch, andmaxBytesPerBatchoptions.Changes:
DataSourceStreamReader.latestOffset()to accept optionalstartandlimitparametersPythonMicroBatchStreamto validate admission control options (throwsIllegalArgumentExceptionfor invalid values)python_streaming_source_runner.pystructured_blockchain_admission_control.py)Why are the changes needed?
Python streaming data sources previously could not control microbatch sizes because
latestOffset()had no parameters to receive configured limits. This forced Python sources to either process all available data (unpredictable resource usage) or artificially limit offsets (risking data loss). Scala sources have this capability viaSupportsAdmissionControl.Detailed Verification
✅ 1. Enhanced
DataSourceStreamReader.latestOffset()APIClaim: Enhanced to accept optional
startandlimitparametersVerification:
python/pyspark/sql/datasource.py(lines 717-761)✅ 2. Admission Control Validation in PythonMicroBatchStream
Claim: Validates admission control options and throws
IllegalArgumentExceptionfor invalid valuesVerification:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonMicroBatchStream.scala(lines 92-125)maxRecordsPerBatch(Long, > 0)maxFilesPerBatch(Int, > 0)maxBytesPerBatch(Long, > 0)IllegalArgumentExceptionfor invalid values✅ 3. Python-JVM Communication Handlers
Claim: Added Python-JVM communication handlers for admission control parameters
Verification:
python/pyspark/sql/streaming/python_streaming_source_runner.pyLATEST_OFFSET_WITH_REPORT_FUNC_ID = 890latest_offset_with_report_func()(lines 108-164)✅ 4. Backward Compatibility Fallback Logic
Claim: Implemented backward compatibility fallback logic
Verification:
python/pyspark/sql/streaming/python_streaming_source_runner.py(lines 140-150)TypeErrorfor old implementations✅ 5. Comprehensive Documentation
Claim: Added comprehensive documentation and examples
Verification:
python/docs/source/tutorial/sql/python_data_source.rstmaxRecordsPerBatch,maxFilesPerBatch,maxBytesPerBatchDataSourceStreamReaderIllegalArgumentException✅ 6. Example Implementation
Claim: Added comprehensive example (
structured_blockchain_admission_control.py)Verification:
examples/src/main/python/sql/streaming/structured_blockchain_admission_control.pyDataSourceStreamReaderimplementationlatestOffset(start, limit)with admission control(capped_offset, true_latest_offset)✅ 7. Test Coverage
Claim: Added Scala test cases validating
IllegalArgumentExceptionfor invalid admission control valuesVerification:
sql/core/src/test/scala/org/apache/spark/sql/execution/python/streaming/PythonStreamingDataSourceSuite.scalaadmission control: maxRecordsPerBatch with SimpleDataSourceStreamReaderadmission control: maxFilesPerBatch optionadmission control: maxBytesPerBatch optionadmission control: invalid maxRecordsPerBatch throws exceptionadmission control: negative maxRecordsPerBatch throws exceptionadmission control: zero maxRecordsPerBatch throws exceptionadmission control: decimal maxRecordsPerBatch throws exceptionadmission control: invalid maxFilesPerBatch throws exceptionadmission control: negative maxFilesPerBatch throws exceptionadmission control: invalid maxBytesPerBatch throws exceptionDoes this PR introduce any user-facing change?
Yes. Users can now implement admission control in custom Python streaming sources:
latestOffset(start, limit)for fine-grained controlHow was this patch tested?
IllegalArgumentExceptionfor invalid admission control valuesstructured_blockchain_admission_control.pydemonstrates the featureWas this patch authored or co-authored using generative AI tooling?
No.
Closes #SPARK-54305