Update worker images to optimize IO performance using local data#675
Open
robertbartel wants to merge 21 commits intoNOAA-OWP:masterfrom
Open
Update worker images to optimize IO performance using local data#675robertbartel wants to merge 21 commits intoNOAA-OWP:masterfrom
robertbartel wants to merge 21 commits intoNOAA-OWP:masterfrom
Conversation
ea6992c to
2da36b8
Compare
12d97d6 to
1f50ce3
Compare
aaraney
reviewed
Aug 19, 2024
| # https://dl.min.io/client/mc/release/linux-amd64/archive/mc.${MINIO_CLIENT_RELEASE} | ||
|
|
||
| # Setup minio client; also update path and make sure dataset directory is there | ||
| RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \ |
Member
There was a problem hiding this comment.
Suggested change
| RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \ | |
| RUN ARCHITECTURE=`echo $(uname -s)-$(uname -m) | tr '[:upper:]' '[:lower:]'`; \ | |
| curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/${ARCHITECTURE}/mc \ |
Contributor
Author
There was a problem hiding this comment.
Hmm ... good catch, but the suggestion won't work as needed with Linux on an Intel machine (it produces linux-x86_64 instead of linux-amd64). I will make an adjustment, but I'll need to think more about exactly how.
Contributor
Author
There was a problem hiding this comment.
That wasn't the only case that would have been a problem, but I've made a change now that I think should catch the platform types we can reasonably expect that would need to be transformed to something else for purposes of that URL, and done the adjustment accordingly.
aaraney
reviewed
Aug 19, 2024
Updating usages of DataRequirement so that whenever the fulfilled_by attribute of an instance is set - creation time or otherwise - the new needs_data_local is also set.
Add 2 new directories - /dmod/local_volumes and /dmod/cluster_volumes - to ngen image directory structure, meant for mount points of different types of volumes containing necessary data for the job; also, adding README with some initial documentation on this directory structure.
Updating Launcher to prepare services with local volume mounts when some data requirements must be fulfilled by local data on the physical node, and to update the relevant other args for starting worker services so that one worker on each node makes sure data gets prepared in local volumes as needed as part of job startup.
Making MinIO CLI client available within ngen worker image and derivatives (e.g., calibration worker), though without a pre-configured alias for connected to the object store service.
Adding functionality to py_funcs.py to support making DMOD dataset data local (not just be locally accessible from remote storage).
Updating main entrypoint scripts for ngen and calibration worker images for local data handling.
Fixing script so that GUI services do not get stopped and updated unless that is actually asked for with the available CLI option.
Moving call to this Python function so that it happens before sanity checks (at the entrypoint level) ensuring dataset directories exist, as they won't exist until any data is made local.
- Order minio client args properly (config dir must come first) - Cleanup output handling during minio client subprocess - Correct a few logical mistakes with how conditionals should behave - Fix issue with path object creation when copying from cluster volume - Adding some helpful logging messages - Make sure we actually create symlinks
- Fixing handling of symlink for output dataset so it points to cluster volume as needed (i.e., so output can actually make it out of the worker) - Fixing some issues with keyword args coming in from CLI that certain functions weren't set up to disregard properly - Adding a bit more helpful logging in places
Adding logic and reordering certain things to make sure that, given local writing initially of job outputs, etc., that process to then move the results to backing dataset storage works properly and does not run into permissions issues.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Updating dependency on core to 0.21.0.
Updating dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Account for building in environments other than Linux X86_64 when downloading the MinIO client for the ngen worker images.
f85ea7b to
25dbf8e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: do not review until #671 is complete.Note: blocked by #673, as testing cannot yet be completed.Note: blocked by #678, as testing still cannot be completed.Note: testing now blocked by a bug being addressed in #697.Optimizing job execution by sometimes locally copying DMOD dataset data to local, per-node Docker volumes when jobs are starting.
Changes
needs_data_localattributeneeds_data_localis set wheneverfulfilled_byis set (i.e., de facto couple this as a part of fulfillment details)/dmod/datasets/**) at startupTesting
Jobs started via the CLI execute successfully and in a reasonable amount of time. The specific test was for a month of VPU 1 and took less than 5 minutes from start to finish.