Closed
Conversation
…options Add CLI flag parsing (--gpu, --docker, --mirror, --mirror-pip, --mirror-npm) as alternatives to environment variables. Flags can appear in any position and override corresponding env vars.
feat(installer): support --key=value CLI flags
Add gfx1201/rdna4/dgpu to resolve_gpu_config mapping (GPU_TARGET=gfx120x). Add gfx120x to CI build-config.json (pytorch_whl: gfx120X-all). ROCm tarball and PyTorch wheels confirmed available at repo.amd.com.
Add 'pack' command to create self-contained offline deployment bundles. Support all 4 image source × deploy target combinations: install build locally + deploy (existing) install --pull pull from GHCR + deploy (new) pack pull from GHCR + save to bundle (new) pack --local build locally + save to bundle (new) Offline bundles include K3s binary/images, Helm, K9s, ROCm device plugin manifest, all container images, and Helm chart+values. Auto-detected via manifest.json when running from bundle directory. Add pack-bundle.yml CI workflow for manual bundle creation.
- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for configurable image source in pack and install --pull - Pack now exits with error if any custom or external image fails to pull, preventing incomplete bundles - Add image_registry input to pack-bundle CI workflow - Read IMAGE_REGISTRY from bundle manifest for offline installs
Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".
…stry/tag restore out of gpu_target guard
… in K3s airgap bundle
…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.
… for consistency
Both pull and local-build modes now save hub/default images with
:latest and :${IMAGE_TAG} tags, matching GPU image behavior.
This ensures values.local.yaml references always resolve regardless
of which IMAGE_TAG was used during pack.
Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.
- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and PLAIN_CUSTOM_NAMES are the single source of truth for image lists - Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime - Remove duplicate generate_values_overlay call in deploy function (orchestration now handled exclusively by callers) - Remove unused check_root function; inline root check at entry points of deploy_all_components and pack_bundle - Add missing section headers for Runtime Management group - rt install/reinstall and legacy install-runtime now correctly call detect/get_paths/generate_overlay before deploy
- Merge gpu_target + gpu_type into single gpu_type choice; installer derives GPU_TARGET internally via resolve_gpu_config - Add rdna4 option (gfx120x) to match upstream installer support - image_tag now defaults to current branch name (github.ref_name) so develop branch packs use 'develop' tag automatically - Use env: block instead of inline var prefix for cleaner CI syntax - Remove root check from pack_bundle; pack only needs docker/wget, not root access (install still requires root)
github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.
Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.
- Add workflow_run trigger: fires after 'Build Docker Images' completes, ensuring all images (hub, base, courses) are built before packing starts - pack-release job: matrix over all 4 GPU types, only runs on v* tags pushed to AMDResearch/aup-learning-cloud (main repo guard) - pack-release attaches bundles to the existing GitHub Release - pack-manual job: unchanged workflow_dispatch flow for manual testing - Fix tar SIGPIPE false error in verify step (2>/dev/null)
… layers Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them separately caused those layers to be written N times. A single docker save call with all image refs deduplicates shared layers automatically, reducing bundle size significantly.
Ensures any v* tag (semver or not) gets pushed with the exact tag name. Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags, causing course image builds to fail when looking for the base image by tag. Also removes main repo restriction from pack-release trigger condition.
- Remove unused ensure_system_group() function (replaced by load_groups) - Add Release Protection button for protected groups in EditGroupModal - Add release_protection PATCH support in handlers.py - Move RESERVED_KEYS to module-level constant in EditGroupModal - Update info banner with sync timing and release protection docs - Improve lazy backfill comment in admin groups API handler - Add release_protection to updateGroup API type signature - Pre-create native-users group via load_groups at startup
…mment - Add try-except for json.loads in PATCH/POST/DELETE group handlers, returning 400 instead of 500 on malformed request bodies - Clarify the native-user resource fallback as defensive code
GitHub-team groups now allow admins to add/remove members manually. This lets native users be added to GitHub-team groups to grant them the same resources. Synced GitHub members are still auto-managed (re-added/removed on next login). - is_readonly_group() now only returns true for system groups - Update UI placeholders and info banners to explain sync behavior - Include unit tests for group protection and sync logic
Clarify that team data is captured at login and group membership is updated at spawn time. Both banners now use consistent wording.
- Add POST /admin/api/groups/sync endpoint that fetches fresh GitHub teams for all users and syncs group memberships immediately - Add Sync Now button in admin UI with loading state and result summary - Update info banner to mention Sync Now as an alternative to waiting for user login/spawn
… filtering
The custom spawn.html template referenced a non-existent `spawner_options`
variable, so `window.AVAILABLE_RESOURCES` was never set. The React spawn
app then showed all resources to every user regardless of group membership.
Fix: `options_form()` now returns a `<script>` tag that injects
`AVAILABLE_RESOURCES` and `SINGLE_NODE_MODE`, and `spawn.html` renders
it via `{{ spawner_options_form | safe }}`.
Also includes UI polish:
- Dark mode: use outline-secondary for Sync Now / Manage Teams buttons
- Sync result alert auto-dismisses after 5 seconds
- Info banner is dismissible (persisted via localStorage) with a
toggle button to re-open it
- Replace db: object with db: Session (TYPE_CHECKING import) - Add assert user.orm_user is not None before accessing .groups - Add type: ignore[assignment] on Group.properties assignments (SQLAlchemy Column type not compatible with plain dict) - Add type: ignore[union-attr] on Group.properties.get() calls
Previously teams were cached in auth_state at login and reused on every spawn, so removing a user from a GitHub team would not take effect until they logged out and back in. Now auth_state_hook fetches fresh teams from GitHub at each spawn and updates the cache. refresh_user() also refreshes teams when proactively refreshing an expiring token.
…gration feat: unify GitHub Teams and JupyterHub Groups with resource isolation
k3s-uninstall.sh only handles its embedded containerd runtime. When k3s is configured with --docker, Pod containers appear in `docker ps` with a k8s_ prefix and are silently skipped by the uninstall script. Add remove_k3s_docker_containers() to stop and remove all k8s_* Docker containers after k3s-uninstall.sh runs. The function lists affected containers and prompts for confirmation before removing. Behaviour in non-interactive environments (CI/CD, pipes): - stdin is not a TTY: skip automatically, print manual cleanup command - --yes / -y flag or AUPLC_YES=1: remove without prompting Upstream issue: k3s-io/k3s#1469
…ntainers Replacing the 'name=k8s_' substring filter with a label-based filter on 'io.kubernetes.pod.name', which kubelet stamps on every container it creates via dockershim/cri-dockerd. This prevents accidentally matching user-created containers that happen to have 'k8s_' in their name.
…rs-cleanup fix(installer): clean up Docker containers left by k3s on uninstall
…ocket origins Add a native YAML-based configuration field `custom.allowedOrigins` to control allowed origins for notebook server WebSocket connections, replacing the need for raw Python injection via `hub.extraConfig`. - ParsedConfig: add `allowedOrigins: list[str]` field - HubConfig: expose `allowed_origins` property - RemoteLabKubeSpawner: inject `--ServerApp.allow_origin_pat` (and `--ServerApp.allow_origin=*` when wildcard is set) into notebook server startup args at spawn time - chart/values.yaml: document new field under `custom` - chart/values.schema.yaml + values.schema.json: add schema definition - runtime/values.yaml: add commented usage examples
Replace the flat `custom.allowedOrigins` field with two clearly scoped fields: - `custom.hub.allowedOrigins`: sets Access-Control-Allow-Origin on Hub HTTP responses via JupyterHub tornado_settings - `custom.notebook.allowedOrigins`: injected into each notebook server's startup args via --ServerApp.allow_origin_pat, targeting kernel WebSocket connections Naming avoids confusion with the z2jh top-level `singleuser` section. Schema updated in values.schema.yaml and values.schema.json.
feat(config): add hub.allowedOrigins and notebook.allowedOrigins for origin policy
- Add Dashboard page to Admin UI with Recharts line/pie charts and Bootstrap table showing daily usage, resource distribution, and top users ranking - Add Tailwind v4 with tw: prefix to Admin UI for modern card layouts without conflicting with Bootstrap 5 - Add NavBar component shared across Users/Groups/Dashboard pages - Add shared TypeScript types and API helpers for stats endpoints - Add three stats API handlers: overview, usage time series, distribution breakdown - Decouple UsageSession writes from quota system: sessions are now always recorded regardless of quota_enabled, so all deployments get dashboard data by default
- New Dashboard page: date range picker, usage trend chart (daily/weekly), course usage ranking with avg session duration, top users table - Active Now panel: SSE-based live feed of running sessions (5s refresh) - NavBar: unified tab navigation across Users/Groups/Dashboard pages; removed redundant 'Manage Groups' / 'Back to Users' buttons - stats_handlers: add StatsActiveSSEHandler, avg_minutes in distribution, ActiveSession type; remove redundant REST active endpoint - spawner stop(): fallback session recovery when usage_session_id lost after Hub restart (finds active DB session by username) - dark mode: use Bootstrap semantic classes instead of Tailwind dark variant
…dless of quota_enabled
…ion times - Backend StatsHourlyHandler accepts tz_offset (minutes ahead of UTC) and applies SQLite datetime offset before extracting hour, so the distribution reflects the viewer's local time instead of server UTC - Frontend getHourlyDistribution sends tz_offset derived from new Date().getTimezoneOffset() (negated to get offset from UTC) - Active session start_time is now parsed as UTC and displayed in the browser's local timezone using toLocaleString()
- store accelerator display labels alongside session data so stats APIs surface both course and accelerator names - extend shared stats types to expose `accelerator_display` - update admin dashboard tables and user detail modal to render friendly course + accelerator labels with fallbacks The change ensures admins always see which course and accelerator a user is consuming, even while quota tracking still keys sessions by accelerator.
feat(admin-dashboard): add usage dashboard and improve admin insights
…ectivity The dummy0 interface adds a default route with metric 1000, which takes priority over the real network interface (e.g. WiFi at metric 20600), causing all traffic to route through a virtual interface with no actual connectivity. Remove the unnecessary default route from both the install step and the systemd service — dummy0 only needs to provide a stable IP for K3s node binding. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Testing
Files Changed
Checklist