feat(incremental): parallelize S3 prefix download [BLDX-1088]#1503
feat(incremental): parallelize S3 prefix download [BLDX-1088]#1503vaibhavatlan wants to merge 12 commits intomainfrom
Conversation
…rency [BLDX-1088] download_s3_prefix_with_structure was downloading files sequentially. For large prefixes with hundreds of files, this multiplied latency linearly. Now uses asyncio.gather with a semaphore (max 20 concurrent) matching the pattern used in storage/batch.py.
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
📜 Docstring Coverage ReportRESULT: PASSED (minimum: 30.0%, actual: 80.0%) Detailed Coverage Report |
📦 Trivy Vulnerability Scan Results
Report SummaryCould not generate summary table (data length mismatch: 9 vs 8). Scan Result Detailsrequirements.txtuv.lock |
📦 Trivy Secret Scan Results
Report SummaryCould not generate summary table (data length mismatch: 9 vs 8). Scan Result Detailsrequirements.txtuv.lock |
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
|
🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/1503 |
|
@sdk-review auto-complete |
|
🔄 SDK Review starting (auto-complete) — ~10 min, possibly longer with fix iterations. Watch live progress
|
|
@sdk-review auto-complete |
|
🔄 SDK Review starting (auto-complete) — ~10 min, possibly longer with fix iterations. Watch live progress
|
SDK Review: PR #1503 — feat(incremental): parallelize S3 prefix download [BLDX-1088]Verdict: NEEDS FIXES
Findings by File
Strengths
|
- Move `import asyncio` to top-level module imports - Add unit tests for download_s3_prefix_with_structure covering: - All files downloaded to correct local paths - Semaphore bounds concurrency to 20 - Empty file list produces no downloads - Files without matching prefix used as-is for relative path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@sdk-review auto-complete |
|
🔄 SDK Re-review starting (auto-complete) — ~10 min, possibly longer with fix iterations. Watch live progress
|
SDK Re-review: PR #1503 — feat(incremental): parallelize S3 prefix download [BLDX-1088]Verdict: READY TO MERGE
Delta from prior review
Findings by File
Strengths
|
atlan-ci
left a comment
There was a problem hiding this comment.
SDK Re-review: previous findings resolved; no new issues introduced by the latest changes. CI passing. Branch up to date. Approved.
Aryamanz29
left a comment
There was a problem hiding this comment.
Looks good - one minor suggestion below:
…wnloads Replaces hardcoded Semaphore(20) with configurable constant defaulting to 4, aligning with batch.py and transfer.py conventions. Exposes max_concurrency param for per-call override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@sdk-review |
|
🔄 SDK Re-review starting (review) — ~10 min. Watch live progress |
SDK Re-review: PR #1503 — feat(incremental): parallelize S3 prefix download [BLDX-1088]Verdict: NEEDS FIXES
Delta from prior review
Findings by File
Strengths
|
| class TestDownloadS3PrefixWithStructure: | ||
| """Tests for parallel S3 prefix download with structure preservation.""" | ||
|
|
||
| @pytest.mark.asyncio |
There was a problem hiding this comment.
Minor [TEST] — @pytest.mark.asyncio is redundant when asyncio_mode = "auto" is set in pyproject.toml. Remove the decorator; plain async def test_... is sufficient. (Same applies to L299, L335, L353.)
| @pytest.mark.asyncio | ||
| async def test_concurrency_bounded_by_semaphore(self): | ||
| """No more than 20 concurrent downloads run at once.""" | ||
| import asyncio |
There was a problem hiding this comment.
Minor [QUAL] — import asyncio inside test method body. Stdlib imports should be at module top level. Move this to the top of the file alongside the other stdlib imports.
|
|
||
| @pytest.mark.asyncio | ||
| async def test_concurrency_bounded_by_semaphore(self): | ||
| """No more than 20 concurrent downloads run at once.""" |
There was a problem hiding this comment.
Minor [TEST] — Docstring says "No more than 20 concurrent downloads" but the test asserts max_concurrent <= 4 (the actual MAX_CONCURRENT_STORAGE_TRANSFERS default). Update the docstring to match the real bound.
|
📋 SDK Review complete — NEEDS FIXES. See the review comment above for findings. Address the issues and comment |
|
v3 relevance check: ✅ Still relevant — parallelize S3 prefix download is a v3 storage optimization. Please rebase on latest main. |
- Remove redundant @pytest.mark.asyncio (asyncio_mode=auto) - Move import asyncio to module top-level - Fix stale '20' in test docstring to match default of 4 - Document max_concurrency in Args docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@sdk-review |
|
🔄 SDK Re-review starting (review) — ~10 min. Watch live progress |
SDK Re-review: PR #1503 — feat(incremental): parallelize S3 prefix download [BLDX-1088]Verdict: READY TO MERGE
Delta from prior review
Findings by FileNo findings. Strengths
|
atlan-ci
left a comment
There was a problem hiding this comment.
SDK Re-review: no new issues introduced by the latest changes. CI passing. Branch up to date. Approved.
What was found
Rule: PERF-010 | File:
application_sdk/common/incremental/helpers.py:241| Severity: CRITICALSequential
await download_file()in a for loop. For large prefixes (hundreds of files), latency scales linearly.What was fixed
Replaced sequential loop with
asyncio.gather+asyncio.Semaphore(20)for bounded concurrent downloads. Matches the pattern already used instorage/batch.py download_prefix.Linear
https://linear.app/atlan-epd/issue/BLDX-1088