Conversation
📝 WalkthroughWalkthroughAdds an adaptive batching system for metadata uploads: new config keys for min/initial/max batch sizes and target batch time, a new upload_batching module with growth/backoff heuristics, and updated upload loop and tests to use dynamic batch sizing with retry/backoff behavior. Changes
Sequence DiagramsequenceDiagram
participant Uploader as Upload Loop
participant API as Upload API
participant Evaluator as Outcome Evaluator
participant Calculator as Batch Size Calculator
Uploader->>Calculator: request current_batch_size
Calculator-->>Uploader: return current_batch_size
Uploader->>API: upload_metadata(batch)
API-->>Uploader: success or error
alt success
Uploader->>Evaluator: report elapsed_time
alt elapsed_time < target_batch_time
Evaluator-->>Calculator: mark_good_batch
Calculator-->>Uploader: increase_batch_size
else elapsed_time >= target_batch_time
Evaluator-->>Calculator: mark_bad_batch
Calculator-->>Uploader: decrease_batch_size
end
Uploader->>Uploader: continue with next batch
else error
Uploader->>Evaluator: classify_error
alt error retryable and batch_size > min
Evaluator-->>Calculator: request_smaller_batch
Calculator-->>Uploader: provide_smaller_batch_size
Uploader->>API: retry_with_smaller_batch (with backoff)
else
Uploader-->>Uploader: abort and surface error
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/data_engine/test_datasource.py (1)
150-188: Add one test for the slow-success downshift path.These tests validate grow-on-success and shrink-on-exception, but they do not cover the
elapsed > target_batch_timebranch indagshub/data_engine/model/datasource.py(Lines 850-855). A dedicated test here would protect the time-based adaptation behavior from regressions.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
dagshub/common/config.pydagshub/data_engine/model/datasource.pytests/data_engine/test_datasource.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.12)
- GitHub Check: build (3.13)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build (3.9)
🧰 Additional context used
🧬 Code graph analysis (3)
dagshub/common/config.py (2)
dagshub/data_engine/model/datapoint.py (1)
get(325-334)dagshub/data_engine/client/loaders/base.py (1)
get(102-124)
tests/data_engine/test_datasource.py (1)
dagshub/data_engine/model/datasource.py (3)
source(180-181)update_metadata(1822-1857)_upload_metadata(751-859)
dagshub/data_engine/model/datasource.py (4)
dagshub/data_engine/model/query_result.py (2)
entries(110-114)entries(117-119)dagshub/data_engine/client/data_client.py (1)
update_metadata(228-248)dagshub/data_engine/client/gql_mutations.py (1)
update_metadata(32-47)dagshub/data_engine/voxel_plugin_server/routes/datasource.py (1)
update_metadata(32-45)
🪛 Ruff (0.15.2)
tests/data_engine/test_datasource.py
[warning] 180-180: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
dagshub/common/config.py (1)
63-74: Config knobs are integrated cleanly.The new min/initial/target metadata batch settings are wired consistently for adaptive upload behavior.
tests/data_engine/test_datasource.py (1)
11-24: Test scaffolding looks good.The direct config import plus
_uploaded_batch_sizeshelper make the adaptive-batching assertions clear and robust.dagshub/data_engine/model/datasource.py (1)
758-805: Adaptive sizing setup is well-structured.The min/initial/max normalization and bounded growth/shrink helpers are clean and easy to reason about.
There was a problem hiding this comment.
Pull request overview
This PR updates Data Engine metadata uploads to use an adaptive batch size strategy (instead of a fixed batch size), with new configuration knobs and tests validating the growth and failure-retry behavior.
Changes:
- Added adaptive batch sizing logic to
Datasource._upload_metadata, growing/shrinking based on success/failure and per-batch elapsed time. - Introduced new config/env knobs for min/initial batch size and a target upload time per batch.
- Added tests covering batch-size growth and retry-with-smaller-batch on failure.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
dagshub/data_engine/model/datasource.py |
Implements adaptive upload batching with growth/shrink heuristics and time-based targeting. |
dagshub/common/config.py |
Adds config variables/env keys for min/initial batch sizes and target batch time. |
tests/data_engine/test_datasource.py |
Adds tests asserting batch growth and failure retry behavior with reduced batch sizes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/data_engine/test_datasource.py (2)
155-158: Factor repeated config patch setup into a helper/fixture.The same patch block is repeated in multiple tests, which makes future tuning harder. A small fixture (e.g.,
set_upload_batch_config(...)) would reduce drift.Also applies to: 170-174, 195-199, 211-214, 228-231
199-199: Make monotonic mocking less brittle.Line 199 hardcodes exactly four
time.monotonic()values; minor internal instrumentation changes can break this test without behavior changes. Consider a generator/helper that safely serves extra values.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dagshub/data_engine/model/datasource.pytests/data_engine/test_datasource.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.9)
- GitHub Check: build (3.11)
- GitHub Check: build (3.13)
- GitHub Check: build (3.10)
- GitHub Check: build (3.12)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/data_engine/test_datasource.py (4)
tests/data_engine/conftest.py (1)
ds(18-19)dagshub/data_engine/client/data_client.py (1)
update_metadata(228-248)dagshub/data_engine/model/metadata/dtypes.py (1)
DatapointMetadataUpdateEntry(23-31)dagshub/data_engine/dtypes.py (1)
MetadataFieldType(20-36)
🪛 Ruff (0.15.2)
tests/data_engine/test_datasource.py
[warning] 180-180: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 238-238: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
tests/data_engine/test_datasource.py (1)
150-246: Great adaptive-batching test coverage.These cases validate the key behavior transitions (grow, shrink-on-failure, slow-success downshift, retryability classification, and partial-batch edge handling) and are well-targeted.
dagshub/data_engine/model/datasource.py (1)
807-821: Retryability classification is a solid improvement.The explicit non-retryable fail-fast path and narrowed retryable exception handling reduce noisy retries and surface real failures sooner.
Also applies to: 847-850
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/data_engine/test_datasource.py (1)
223-246: Add a regression test for failure at exactmin_batch_size.Current coverage validates partial-batch-below-min retries, but it doesn’t assert behavior when a retryable error occurs at exactly the configured minimum. Please add that case so min-floor behavior is locked in.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dagshub/data_engine/model/datasource.pytests/data_engine/test_datasource.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.9)
- GitHub Check: build (3.11)
- GitHub Check: build (3.12)
- GitHub Check: build (3.10)
- GitHub Check: build (3.13)
🧰 Additional context used
🧬 Code graph analysis (1)
dagshub/data_engine/model/datasource.py (1)
dagshub/data_engine/model/errors.py (1)
DataEngineGqlError(35-45)
🪛 Ruff (0.15.2)
tests/data_engine/test_datasource.py
[warning] 180-180: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 238-238: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 264-264: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
dagshub/data_engine/model/datasource.py (1)
807-821: Retryable error classification + bounded backoff integration looks solid.Good call separating retryable/non-retryable failures and resetting backoff after successful uploads.
Also applies to: 860-863, 875-875
tests/data_engine/test_datasource.py (1)
150-271: Great adaptive batching test coverage expansion.The added scenarios (growth, retry downsizing, slow-batch reduction, non-retryable abort, and backoff reset) materially improve confidence in the new upload loop.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
dagshub/data_engine/model/datasource.py (1)
860-862: Backoff cap is effectively 4s, not 5s.The current exponent clamp prevents the expression from ever reaching the configured 5s ceiling. Consider adjusting the exponent cap so the 5s max is reachable.
Patch
- retry_delay_sec = min(5.0, 0.25 * (2 ** min(consecutive_retryable_failures - 1, 4))) + retry_delay_sec = min(5.0, 0.25 * (2 ** min(consecutive_retryable_failures - 1, 5)))tests/data_engine/test_datasource.py (1)
165-188: Avoid real backoff sleeps in retry tests.These retry-path tests can incur actual sleep delays, which slows the suite. Mocking
time.sleephere keeps tests fast and deterministic.Patch
def test_upload_metadata_retries_with_smaller_batch_after_failure(ds, mocker): + mocker.patch("dagshub.data_engine.model.datasource.time.sleep") entries = [ DatapointMetadataUpdateEntry(f"dp-{i}", "field", str(i), MetadataFieldType.INTEGER) for i in range(10) ] @@ def test_upload_metadata_retries_partial_batch_below_min(ds, mocker): + mocker.patch("dagshub.data_engine.model.datasource.time.sleep") entries = [ DatapointMetadataUpdateEntry(f"dp-{i}", "field", str(i), MetadataFieldType.INTEGER) for i in range(10) ]Also applies to: 223-246
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dagshub/data_engine/model/datasource.pytests/data_engine/test_datasource.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.12)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build (3.9)
- GitHub Check: build (3.13)
🧰 Additional context used
🧬 Code graph analysis (1)
dagshub/data_engine/model/datasource.py (3)
dagshub/data_engine/model/errors.py (1)
DataEngineGqlError(35-45)dagshub/data_engine/client/data_client.py (1)
update_metadata(228-248)dagshub/data_engine/client/gql_mutations.py (1)
update_metadata(32-47)
🪛 Ruff (0.15.2)
tests/data_engine/test_datasource.py
[warning] 180-180: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 238-238: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 264-264: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
dagshub/data_engine/model/datasource.py (1)
807-821: Good retryability split and failure handling path.The non-retryable fast-fail branch plus targeted retryable exception filtering is a strong reliability improvement here.
Also applies to: 848-872
tests/data_engine/test_datasource.py (1)
150-290: Excellent adaptive-batching test coverage.These scenarios exercise core growth/shrink behavior, retry classification, min-bound handling, and backoff reset semantics well.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| next_batch_size = batch_size + max(1, (bad_batch_size - batch_size) // 2) | ||
| else: | ||
| next_batch_size = batch_size * 2 | ||
|
|
There was a problem hiding this comment.
It will take at least 6 upload batches before things scale up from 150 (~2**7) to 15k (~2**13)
Considering how every batch will be a "checkpoint" in the metadata history, I think going for a "ask for forgiveness rather than permission" approach is better, and scaling up much faster (say, x10), and then scaling down dividing by two, will get us quicker to a good bound.
However, I honestly am thinking that this should be implemented in a way that actually checks the byte ammount that's being sent over, instead of number of metadata values being uploaded.
We're always hitting problems whenever users upload blobs, so if someone puts a megabyte blob in 150 files, things will still break.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
dagshub/common/config.py (1)
61-67: Clarify the max batch variable name for symmetry and readability.
dataengine_metadata_upload_batch_sizenow represents the max bound. Adding an explicit..._maxvariable (while keeping the old name as alias) will make config intent much clearer.Proposed refactor
DATAENGINE_METADATA_UPLOAD_BATCH_SIZE_MAX_KEY = "DAGSHUB_DE_METADATA_UPLOAD_BATCH_SIZE_MAX" -dataengine_metadata_upload_batch_size = int( +dataengine_metadata_upload_batch_size_max = int( os.environ.get( DATAENGINE_METADATA_UPLOAD_BATCH_SIZE_MAX_KEY, os.environ.get(DATAENGINE_METADATA_UPLOAD_BATCH_SIZE_KEY, 15000), ) ) +# Backwards compatibility alias +dataengine_metadata_upload_batch_size = dataengine_metadata_upload_batch_size_max
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 515750c5-a1a0-4151-85b8-d61f5361d99e
📒 Files selected for processing (4)
dagshub/common/config.pydagshub/data_engine/model/datasource.pydagshub/data_engine/model/metadata/upload_batching.pytests/data_engine/test_datasource.py
📜 Review details
🧰 Additional context used
🧬 Code graph analysis (3)
dagshub/data_engine/model/metadata/upload_batching.py (1)
dagshub/data_engine/model/errors.py (1)
DataEngineGqlError(35-45)
tests/data_engine/test_datasource.py (3)
tests/data_engine/conftest.py (1)
ds(18-19)dagshub/data_engine/client/data_client.py (1)
update_metadata(228-248)dagshub/data_engine/dtypes.py (1)
MetadataFieldType(20-36)
dagshub/data_engine/model/datasource.py (1)
dagshub/data_engine/model/metadata/upload_batching.py (6)
AdaptiveUploadBatchConfig(25-51)get_retry_delay_seconds(114-116)is_retryable_metadata_upload_error(97-111)next_batch_after_retryable_failure(78-94)next_batch_after_success(58-75)from_values(32-51)
🪛 Ruff (0.15.2)
tests/data_engine/test_datasource.py
[warning] 181-181: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 207-207: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 266-266: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 292-292: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
dagshub/data_engine/model/datasource.py (2)
56-62: Nice modularization of adaptive upload policy.Moving batching/retry policy into
dagshub.data_engine.model.metadata.upload_batchingkeeps_upload_metadatafocused and easier to maintain.
775-775:⚠️ Potential issue | 🟡 MinorProgress label is inaccurate when retries go below configured minimum.
Line 775 advertises
min-max, but retry logic can still drop belowmin(down to 1). Please adjust wording so users aren’t misled during uploads.Suggested text tweak
- f"Uploading metadata (adaptive batch {batch_config.min_batch_size}-{batch_config.max_batch_size})...", + ( + "Uploading metadata " + f"(adaptive target {batch_config.min_batch_size}-{batch_config.max_batch_size}, " + "retries may go lower)..." + ),> Likely an incorrect or invalid review comment.tests/data_engine/test_datasource.py (1)
150-318: Great coverage for adaptive upload behavior.These tests exercise growth, retry shrink, known-bad-size avoidance, non-retryable aborts, and backoff reset paths well.
| def is_retryable_metadata_upload_error(exc: Exception) -> bool: | ||
| if isinstance(exc, DataEngineGqlError): | ||
| return isinstance(exc.original_exception, (TransportServerError, TransportConnectionFailed)) | ||
|
|
There was a problem hiding this comment.
Wrapped retryable errors can be misclassified and skip retries.
When exc is DataEngineGqlError, only two wrapped types are checked. If original_exception is another retryable type already recognized below (e.g., timeout/connection), upload aborts instead of retrying.
Proposed fix
def is_retryable_metadata_upload_error(exc: Exception) -> bool:
if isinstance(exc, DataEngineGqlError):
- return isinstance(exc.original_exception, (TransportServerError, TransportConnectionFailed))
+ original_exception = exc.original_exception
+ return isinstance(original_exception, Exception) and is_retryable_metadata_upload_error(original_exception)
return isinstance(
exc,
(
TransportServerError,
Summary
Testing