feat(cli): Add --include-path selective filtering to download-output#1108
feat(cli): Add --include-path selective filtering to download-output#1108larrygao001 wants to merge 15 commits intoaws-deadline:mainlinefrom
Conversation
856e788 to
d7b27ac
Compare
98c3e8b to
51e68e8
Compare
crowecawcaw
left a comment
There was a problem hiding this comment.
Let's split up the download-input command from the filtering since the changes are independent and to make it easer to review. Suggest starting with the path filtering first and then adding download-input with the patterns we land on in the first PR.
51e68e8 to
d933830
Compare
|
Add selective download support for job output attachments: - Add --include-path and --include-path-stdin options to download-output for downloading specific files or directory prefixes - Add _filter_paths() and _matches_any_filter() in download.py - Add path_filters parameter to OutputDownloader - Add _validate_and_normalize_include_paths() with path traversal rejection, backslash-to-forward-slash conversion, and normalization - Fix bug in get_job_input_paths_by_asset_root() where S3 root prefix was not prepended to input manifest keys - Update job_attachments_guide.md to document --include-path - Add unit tests for path filtering and validation Signed-off-by: larrygao <larrygao@amazon.com>
ca11fcd to
2ee1e48
Compare
| if ".." in f: | ||
| raise click.BadParameter(f"Path filter must not contain '..': {f}") |
There was a problem hiding this comment.
I don't think we need to reject this, even though it's odd. This is a path filter so it chooses what files are downloaded. So if someone's trying to do something malicious with a weird path, at most they'll download all files for the job. If we were attaching files where .. could get the submitter access to unexpected files on the workstation, we'd need to think a bit more.
I think it's better to leave it out because it's simpler and so it doesn't suggest there's a security issue to defend against and which requires a robust control.
There was a problem hiding this comment.
Good call, I left this out and am no longer validating against it so .. would just be treated like any other non-existant path
| if output != "json": | ||
| try: | ||
| tty_path = "CON" if sys.platform == "win32" else "/dev/tty" | ||
| sys.stdin = open(tty_path) # noqa: SIM115 | ||
| except OSError: | ||
| pass # Non-interactive environment — no TTY to reopen |
There was a problem hiding this comment.
From what I can tell, this comes in to play if the caller sends in paths via stdin then later wants to send interactive commands (e.g. confirming something) also via stdin. So this code switches stdin back to TTY after the paths finish reading from stdin. I think this logic is confusing and unnecessary. Let's not mix the two. If someone sends in paths via stdin, they will not get any interactive prompts. If there are interactions required, we should either choose a default option that makes sense or fail with an error so the caller always needs to pass in the choice explicitly.
There was a problem hiding this comment.
Makes sense, fixed in latest commit.
crowecawcaw
left a comment
There was a problem hiding this comment.
We added a new set of testes for job attachment CLI commands in this PR: #1125
The idea to test CLI commands without any internal mocking by using mock HTTP servers for S3 and Deadline. This feature would be great to cover with this new test pattern. It'd look something like creating a temp directory, creating a manifest with some real files in S3, downloading the files with this filter, and asserting the files that are created in our temp/test directory are only the ones we expect.
- Remove path traversal (..) rejection from path filter normalization; filters only select which files to download so .. is harmless - Remove TTY stdin reopen hack; force auto_accept when --include-path-stdin is used so no interactive prompts are attempted - Rename _validate_and_normalize_include_paths to _normalize_include_paths - Add e2e tests using mock HTTP server pattern: directory prefix, exact file, multiple filters, no-match, stdin, and stdin+json (DCM pattern) - Extract _seed_output_job helper to reduce e2e test boilerplate Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
|
Added the following e2e tests in latest commit:
|
| raise click.UsageError("Missing option '--step-id' required with '--task-id'") | ||
|
|
||
| filters = list(include_path) | ||
| if include_path_stdin: |
There was a problem hiding this comment.
This is really odd - we did quite a bit of work before on how to source input glob filters. See the previous thinking in manifest snapshot.
Align download-output filtering with the manifest CLI pattern: - Replace --include-path/--include-path-stdin with -i/--include and -e/--exclude options supporting glob patterns via fnmatch - Add -ie/--include-exclude-config for JSON string or file input, reusing _process_glob_inputs from the manifest CLI - Default to include-all when only --exclude is provided - Rename OutputDownloader param path_filters to include_filters - Add e2e tests: exclude, glob patterns, config file, and large inline JSON blob (200 files, 150 selected) Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
|
Updated to mirror the manifest style arguments ( New tests:
|
Replace the --include/--exclude/--include-exclude-config options on download-output with --include, --include-config, and --submission-path. Key changes: - Match glob patterns against the full path (root + relative) instead of just the relative path, enabling patterns like "*/renders/*.png" - Filter against workstation paths by default; add --submission-path flag to filter against original submission paths instead - Remove --exclude and --include-exclude-config; rename to --include-config (include-only) - Normalize Windows backslash root paths for cross-OS glob matching - Add OutputDownloader.apply_include_filters() to avoid reaching into internals from the CLI layer - Extract _full_path() helper to DRY up path joining in filter functions Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
|
Discussed this extensively offline, here is a quick summary of the latest changes: What changedCLI options:
Filtering behavior:
Code quality:
DCM compatibilityThe --include-config option accepting a file path covers DCM's use case for large filter sets (avoids Windows CreateProcess 32KB argument limit). The deadline:// protocol path is TestsAll existing tests updated, plus new coverage for:
|
Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
Plain relative paths (no glob characters, not absolute) are now matched as a suffix against the full path. This allows filters like "renders/frame_001.exr" to match "/home/user/project/renders/frame_001.exr" without requiring glob prefixes. This supports the DCM use case where the frontend has relative paths from manifests and passes them directly as include filters. Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
da4d3f4 to
1c9f256
Compare
Remove --include-config and stdin (--include -) options, keeping only --include as the single filtering mechanism. This simplifies the CLI surface while covering all use cases: - Human CLI: --include "*.exr" or --include "renders/frame_001.exr" - DCM: multiple --include args via Tauri shell with args:true scope - Web copy/paste: repeated --include flags in the command The --include-config and stdin approaches can be added back later if the Windows 32KB CreateProcess limit becomes a practical concern. Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
Summary of latest changesSimplified the CLI to just two options: --include and --submission-path. --include (repeatable)Supports both glob patterns and plain relative paths: bash Glob patternsdeadline job download-output --include ".exr" Relative paths (matched as suffix against full path)deadline job download-output --include "renders/frame_001.exr" Multiple filters (OR'd)deadline job download-output -i "renders/frame_001.exr" -i "renders/frame_002.exr" -i "*.log" Patterns are matched against the full path (root + relative), so /renders/.png works across directory structures. Plain relative paths like renders/frame_001.exr match as a suffix — --submission-path flagBy default, filters match against the local workstation paths (after storage profile mapping). Pass --submission-path to filter against the original submission paths instead. Why this covers all our use cases
What was removed
Cross-OSWindows backslash root paths are normalized to forward slashes for matching without affecting download paths. A Windows user's --include "renders\frame.exr" works the same as a Mac |
| return normalized | ||
|
|
||
|
|
||
| def _parse_include_filters(include: tuple[str, ...]) -> Optional[list[str]]: |
There was a problem hiding this comment.
Nit: feels like an unnecessary function that could be rolled into _normalize_filters instead.
There was a problem hiding this comment.
Yeah pretty redundant, I went ahead and inlined it in the new implementation.
| for manifest in job_attachments_manifests: | ||
| root_path_format_mapping[manifest["rootPath"]] = manifest["rootPathFormat"] | ||
|
|
||
| # When --submission-path is set, filter against the submission (S3) paths. |
There was a problem hiding this comment.
Super nit: more like the relative job attachment paths. S3 is more of an implementation detail of job attachments
There was a problem hiding this comment.
Dropped "s3" in latest commit!
| task_id=task_id, | ||
| session_action_id=session_action_id, | ||
| session=queue_role_session, | ||
| include_filters=include_filters if submission_path else None, |
There was a problem hiding this comment.
Nit: could use better variable names to differentiate the two include_filters variables. There's probably a 3rd meaning elsewhere for workstation filters.
There was a problem hiding this comment.
Renamed to include_patterns when referring to user provided patterns, and include_filters when referring to the actual filter that gets applied to distinguish between the two!
- Remove _is_relative_filter and suffix matching; instead auto-prepend */ to relative patterns so all non-directory filters use fnmatch. This enables relative globs like "renders/*.exr" to work naturally. - Remove _parse_include_filters wrapper; inline _normalize_filters at the call site. - Rename include_filters to include_patterns in _download_job_output for clarity (OutputDownloader keeps include_filters as its API). - Fix unit tests to use full paths as file_path input, matching real usage through _full_path(). Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
Replace the --submission-path boolean flag with --match-paths-by that accepts SOURCE or WORKSTATION (default). This makes the option self-documenting about the two filtering modes. Also clean up comments to reference "source" instead of "S3". Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
Rename SOURCE to JOB and WORKSTATION to LOCAL for clarity: - JOB matches against the paths recorded at job submission - LOCAL matches against the local download paths (default) Also update help text to be more user-friendly. Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>
| vfs_manager.start(session_dir=session_dir) | ||
|
|
||
|
|
||
| def _full_path(root: str, relative: str) -> str: |
There was a problem hiding this comment.
There's other code that joins the roots and manifest paths. Can we use the same mechanism here to make sure we don't have differences?
|
|
||
| for f in filters: | ||
| if f.endswith("/"): | ||
| if file_path.startswith(f): |
There was a problem hiding this comment.
Shouldn't this still use fnmatch?
There was a problem hiding this comment.
Yes, good catch. Will fix in next commit.
| ) | ||
|
|
||
|
|
||
| class TestMatchesAnyFilter: |
There was a problem hiding this comment.
Looks like this is only testing against posix paths, need windows paths with \ separators too.
There was a problem hiding this comment.
Adding in next commit!
| is_json_format: bool = False, | ||
| ignore_storage_profiles: bool = False, | ||
| include_patterns: Optional[list[str]] = None, | ||
| match_paths_by: str = "LOCAL", |
There was a problem hiding this comment.
Question, why a hard coded "LOCAL", not a string enum?
There was a problem hiding this comment.
I see this parameter is LOCAL or JOB?
There was a problem hiding this comment.
Would it cause the code to be brittle since it implies users to know this string has 2 explicit values.
There was a problem hiding this comment.
Just a foolish blunder, no more reasoning than that. Will fix!
| match anywhere under the root — e.g. 'renders/*.exr' matches '*/renders/*.exr'. | ||
| The file_path should be the full path (root + relative). | ||
| """ | ||
| from fnmatch import fnmatch |
There was a problem hiding this comment.
nit :) Kiro loves inline imports, can we move it up to the top.
| vfs_manager.start(session_dir=session_dir) | ||
|
|
||
|
|
||
| def _full_path(root: str, relative: str) -> str: |
There was a problem hiding this comment.
WARNING - we cannot merge this now.
There was a problem hiding this comment.
Thanks for the heads up, I'll raise a new PR against the new repo.
leongdl
left a comment
There was a problem hiding this comment.
WARNING - this collides with the JA separation.
- Add MatchPathsBy enum (JOB/LOCAL) replacing raw strings - Use posixpath.join in _full_path for consistency with _transform_manifests_to_absolute_paths - Use fnmatch for directory prefix filters instead of startswith - Handle Windows drive letter paths as absolute in _matches_any_filter - Move fnmatch import to top of download.py - Add Windows path separator tests for _matches_any_filter Signed-off-by: Nathy MacKinlay <61921733+waninggibbon@users.noreply.github.com>



Add selective download support for job output attachments:
--include-pathand--include-path-stdinoptions todownload-outputfor downloading specific files or directory prefixes_filter_paths()and_matches_any_filter()indownload.pypath_filtersparameter toOutputDownloader_validate_and_normalize_include_paths()with path traversal rejection, backslash-to-forward-slash conversion, and normalizationget_job_input_paths_by_asset_root()where S3 root prefix was not prepended to input manifest keys_run_download_ux(),_get_job_download_context(),_parse_filters_and_config(),_handle_download_error()job_attachments_guide.mdto document--include-pathA follow-up PR will add the
download-inputcommand using the patterns established here.Fixes:
What was the problem/requirement? (What/Why)
The existing
deadline job download-outputcommand downloads all output files for a job. Users have no way to download only specific files or folders from a job's output. This makes the CLI less useful for workflows where users only need a subset of files.What was the solution? (How)
--include-path(repeatable): accepts exact relative file paths or directory prefixes (trailing/). Multiple values are OR'd. Without--include-path, behavior is unchanged — all outputs are downloaded.--include-path-stdin: reads one path per line from stdin until EOF or an empty line (sentinel). After consuming stdin, reopens/dev/tty(orCONon Windows) so interactive prompts still work. This is needed for the Deadline Cloud Monitor desktop app (Tauri) integration, where the Monitor writes file paths to the child process's stdin to avoid shell argument limits and Tauri's scoped command argument validation._validate_and_normalize_include_paths()rejects..(path traversal), converts\to/(Windows compatibility), strips./, collapses//.get_job_input_paths_by_asset_root()was not prepending the S3 root prefix + Manifests folder to input manifest keys (unlikeasset_sync.pywhich does this correctly). Fixed to uses3_settings.add_root_and_manifest_folder_prefix()._run_download_ux(),_get_job_download_context(),_parse_filters_and_config(), and_handle_download_error()to prepare for the follow-updownload-inputPR.What is the impact of this change?
download-outputwithout--include-pathbehaves exactly as before — no breaking change.get_job_input_paths_by_asset_root()bug fix changes the S3 key used to fetch input manifests. This function was not previously called by any CLI command. The worker agent usesasset_sync.pywhich already had the correct prefix logic.How was this change tested?
See DEVELOPMENT.md for information on running tests.
download-output --include-path "output/output_0005.png"→ downloaded 1 of 25 filesdownload-outputwith 10--include-pathvalues → downloaded exactly 10 of 25 files--include-path "nonexistent.txt"→ "No output files available"--include-path "../../etc/passwd"→ rejected with "Path filter must not contain '..'"--include-path-stdinwithout piped data → reads until empty line, then continues to interactive promptsdownloadorasset_syncmodules? If so, then it is highly recommendedthat you ensure that the docker-based unit tests pass.
download.py. Docker-based tests should be run before merge.Was this change documented?
docs/cli_reference/deadline_job.md) is auto-generated from click decorators — new options will appear automatically.docs/job_attachments_guide.mdupdated to mention--include-path.Does this PR introduce new dependencies?
Is this a breaking change?
No.
download-outputwithout--include-pathbehaves identically to before. Theget_job_input_paths_by_asset_root()fix changes behavior of a function that was not previously called from any CLI command.Does this change impact security?
--include-pathoption introduces a new input vector. Path traversal is mitigated with defense in depth:_validate_and_normalize_include_paths()rejects any filter containing..at parse time._ensure_paths_within_directory()validates download destination paths stay within the asset root at download time (existing check).download-output.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.