Bug/fix dummy0 route breaks network by KerwinTsaiii · Pull Request #49 · AMDResearch/aup-learning-cloud

KerwinTsaiii · 2026-04-01T09:58:46Z

Summary

Changes

Testing

Files Changed

Checklist

Code follows project style guidelines
Changes are backward compatible
Tested on local Kubernetes cluster
Documentation links updated

…options Add CLI flag parsing (--gpu, --docker, --mirror, --mirror-pip, --mirror-npm) as alternatives to environment variables. Flags can appear in any position and override corresponding env vars.

feat(installer): support --key=value CLI flags

Add gfx1201/rdna4/dgpu to resolve_gpu_config mapping (GPU_TARGET=gfx120x). Add gfx120x to CI build-config.json (pytorch_whl: gfx120X-all). ROCm tarball and PyTorch wheels confirmed available at repo.amd.com.

Add 'pack' command to create self-contained offline deployment bundles. Support all 4 image source × deploy target combinations: install build locally + deploy (existing) install --pull pull from GHCR + deploy (new) pack pull from GHCR + save to bundle (new) pack --local build locally + save to bundle (new) Offline bundles include K3s binary/images, Helm, K9s, ROCm device plugin manifest, all container images, and Helm chart+values. Auto-detected via manifest.json when running from bundle directory. Add pack-bundle.yml CI workflow for manual bundle creation.

- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for configurable image source in pack and install --pull - Pack now exits with error if any custom or external image fails to pull, preventing incomplete bundles - Add image_registry input to pack-bundle CI workflow - Read IMAGE_REGISTRY from bundle manifest for offline installs

Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".

…stry/tag restore out of gpu_target guard

… in K3s airgap bundle

…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.

… for consistency Both pull and local-build modes now save hub/default images with :latest and :${IMAGE_TAG} tags, matching GPU image behavior. This ensures values.local.yaml references always resolve regardless of which IMAGE_TAG was used during pack.

Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.

- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and PLAIN_CUSTOM_NAMES are the single source of truth for image lists - Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime - Remove duplicate generate_values_overlay call in deploy function (orchestration now handled exclusively by callers) - Remove unused check_root function; inline root check at entry points of deploy_all_components and pack_bundle - Add missing section headers for Runtime Management group - rt install/reinstall and legacy install-runtime now correctly call detect/get_paths/generate_overlay before deploy

- Merge gpu_target + gpu_type into single gpu_type choice; installer derives GPU_TARGET internally via resolve_gpu_config - Add rdna4 option (gfx120x) to match upstream installer support - image_tag now defaults to current branch name (github.ref_name) so develop branch packs use 'develop' tag automatically - Use env: block instead of inline var prefix for cleaner CI syntax - Remove root check from pack_bundle; pack only needs docker/wget, not root access (install still requires root)

github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.

Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.

- Add workflow_run trigger: fires after 'Build Docker Images' completes, ensuring all images (hub, base, courses) are built before packing starts - pack-release job: matrix over all 4 GPU types, only runs on v* tags pushed to AMDResearch/aup-learning-cloud (main repo guard) - pack-release attaches bundles to the existing GitHub Release - pack-manual job: unchanged workflow_dispatch flow for manual testing - Fix tar SIGPIPE false error in verify step (2>/dev/null)

… layers Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them separately caused those layers to be written N times. A single docker save call with all image refs deduplicates shared layers automatically, reducing bundle size significantly.

Ensures any v* tag (semver or not) gets pushed with the exact tag name. Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags, causing course image builds to fail when looking for the base image by tag. Also removes main repo restriction from pack-release trigger condition.

- Remove unused ensure_system_group() function (replaced by load_groups) - Add Release Protection button for protected groups in EditGroupModal - Add release_protection PATCH support in handlers.py - Move RESERVED_KEYS to module-level constant in EditGroupModal - Update info banner with sync timing and release protection docs - Improve lazy backfill comment in admin groups API handler - Add release_protection to updateGroup API type signature - Pre-create native-users group via load_groups at startup

…mment - Add try-except for json.loads in PATCH/POST/DELETE group handlers, returning 400 instead of 500 on malformed request bodies - Clarify the native-user resource fallback as defensive code

GitHub-team groups now allow admins to add/remove members manually. This lets native users be added to GitHub-team groups to grant them the same resources. Synced GitHub members are still auto-managed (re-added/removed on next login). - is_readonly_group() now only returns true for system groups - Update UI placeholders and info banners to explain sync behavior - Include unit tests for group protection and sync logic

Clarify that team data is captured at login and group membership is updated at spawn time. Both banners now use consistent wording.

- Add POST /admin/api/groups/sync endpoint that fetches fresh GitHub teams for all users and syncs group memberships immediately - Add Sync Now button in admin UI with loading state and result summary - Update info banner to mention Sync Now as an alternative to waiting for user login/spawn

… filtering The custom spawn.html template referenced a non-existent `spawner_options` variable, so `window.AVAILABLE_RESOURCES` was never set. The React spawn app then showed all resources to every user regardless of group membership. Fix: `options_form()` now returns a `<script>` tag that injects `AVAILABLE_RESOURCES` and `SINGLE_NODE_MODE`, and `spawn.html` renders it via `{{ spawner_options_form | safe }}`. Also includes UI polish: - Dark mode: use outline-secondary for Sync Now / Manage Teams buttons - Sync result alert auto-dismisses after 5 seconds - Info banner is dismissible (persisted via localStorage) with a toggle button to re-open it

- Replace db: object with db: Session (TYPE_CHECKING import) - Add assert user.orm_user is not None before accessing .groups - Add type: ignore[assignment] on Group.properties assignments (SQLAlchemy Column type not compatible with plain dict) - Add type: ignore[union-attr] on Group.properties.get() calls

Previously teams were cached in auth_state at login and reused on every spawn, so removing a user from a GitHub team would not take effect until they logged out and back in. Now auth_state_hook fetches fresh teams from GitHub at each spawn and updates the cache. refresh_user() also refreshes teams when proactively refreshing an expiring token.

…gration feat: unify GitHub Teams and JupyterHub Groups with resource isolation

…o resource union

k3s-uninstall.sh only handles its embedded containerd runtime. When k3s is configured with --docker, Pod containers appear in `docker ps` with a k8s_ prefix and are silently skipped by the uninstall script. Add remove_k3s_docker_containers() to stop and remove all k8s_* Docker containers after k3s-uninstall.sh runs. The function lists affected containers and prompts for confirmation before removing. Behaviour in non-interactive environments (CI/CD, pipes): - stdin is not a TTY: skip automatically, print manual cleanup command - --yes / -y flag or AUPLC_YES=1: remove without prompting Upstream issue: k3s-io/k3s#1469

…ntainers Replacing the 'name=k8s_' substring filter with a label-based filter on 'io.kubernetes.pod.name', which kubelet stamps on every container it creates via dockershim/cri-dockerd. This prevents accidentally matching user-created containers that happen to have 'k8s_' in their name.

…rs-cleanup fix(installer): clean up Docker containers left by k3s on uninstall

…ocket origins Add a native YAML-based configuration field `custom.allowedOrigins` to control allowed origins for notebook server WebSocket connections, replacing the need for raw Python injection via `hub.extraConfig`. - ParsedConfig: add `allowedOrigins: list[str]` field - HubConfig: expose `allowed_origins` property - RemoteLabKubeSpawner: inject `--ServerApp.allow_origin_pat` (and `--ServerApp.allow_origin=*` when wildcard is set) into notebook server startup args at spawn time - chart/values.yaml: document new field under `custom` - chart/values.schema.yaml + values.schema.json: add schema definition - runtime/values.yaml: add commented usage examples

Replace the flat `custom.allowedOrigins` field with two clearly scoped fields: - `custom.hub.allowedOrigins`: sets Access-Control-Allow-Origin on Hub HTTP responses via JupyterHub tornado_settings - `custom.notebook.allowedOrigins`: injected into each notebook server's startup args via --ServerApp.allow_origin_pat, targeting kernel WebSocket connections Naming avoids confusion with the z2jh top-level `singleuser` section. Schema updated in values.schema.yaml and values.schema.json.

feat(config): add hub.allowedOrigins and notebook.allowedOrigins for origin policy

- Add Dashboard page to Admin UI with Recharts line/pie charts and Bootstrap table showing daily usage, resource distribution, and top users ranking - Add Tailwind v4 with tw: prefix to Admin UI for modern card layouts without conflicting with Bootstrap 5 - Add NavBar component shared across Users/Groups/Dashboard pages - Add shared TypeScript types and API helpers for stats endpoints - Add three stats API handlers: overview, usage time series, distribution breakdown - Decouple UsageSession writes from quota system: sessions are now always recorded regardless of quota_enabled, so all deployments get dashboard data by default

- New Dashboard page: date range picker, usage trend chart (daily/weekly), course usage ranking with avg session duration, top users table - Active Now panel: SSE-based live feed of running sessions (5s refresh) - NavBar: unified tab navigation across Users/Groups/Dashboard pages; removed redundant 'Manage Groups' / 'Back to Users' buttons - stats_handlers: add StatsActiveSSEHandler, avg_minutes in distribution, ActiveSession type; remove redundant REST active endpoint - spawner stop(): fallback session recovery when usage_session_id lost after Hub restart (finds active DB session by username) - dark mode: use Bootstrap semantic classes instead of Tailwind dark variant

…on warnings

…dless of quota_enabled

… time series

…ion times - Backend StatsHourlyHandler accepts tz_offset (minutes ahead of UTC) and applies SQLite datetime offset before extracting hour, so the distribution reflects the viewer's local time instead of server UTC - Frontend getHourlyDistribution sends tz_offset derived from new Date().getTimezoneOffset() (negated to get offset from UTC) - Active session start_time is now parsed as UTC and displayed in the browser's local timezone using toLocaleString()

- store accelerator display labels alongside session data so stats APIs surface both course and accelerator names - extend shared stats types to expose `accelerator_display` - update admin dashboard tables and user detail modal to render friendly course + accelerator labels with fallbacks The change ensures admins always see which course and accelerator a user is consuming, even while quota tracking still keys sessions by accelerator.

feat(admin-dashboard): add usage dashboard and improve admin insights

…ectivity The dummy0 interface adds a default route with metric 1000, which takes priority over the real network interface (e.g. WiFi at metric 20600), causing all traffic to route through a virtual interface with no actual connectivity. Remove the unnecessary default route from both the install step and the systemd service — dummy0 only needs to provide a stable IP for K3s node binding. Made-with: Cursor

MioYuuIH and others added 30 commits March 6, 2026 10:24

feat(installer): support --key=value CLI flags for all configuration …

2a44b8b

…options Add CLI flag parsing (--gpu, --docker, --mirror, --mirror-pip, --mirror-npm) as alternatives to environment variables. Flags can appear in any position and override corresponding env vars.

Merge pull request #39 from AMDResearch/feat/installer-cli-flags

bfe96eb

feat(installer): support --key=value CLI flags

docs: update README to use new --flag syntax for installer options

f809d02

feat(installer): add RDNA4 (gfx1201) GPU support

6eedc05

Add gfx1201/rdna4/dgpu to resolve_gpu_config mapping (GPU_TARGET=gfx120x). Add gfx120x to CI build-config.json (pytorch_whl: gfx120X-all). ROCm tarball and PyTorch wheels confirmed available at repo.amd.com.

Merge branch 'feat/installer-cli-flags' into develop

1976561

feat(installer): add IMAGE_TAG env var for configurable image tag prefix

5dd55b9

Support pulling images with non-default tag prefixes (e.g. develop-gfx1151 instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest and restored on offline install. Default remains "latest".

fix(installer): fix glob quoting in load_offline_images and move regi…

f3dc904

…stry/tag restore out of gpu_target guard

fix(installer): remove traefik from EXTERNAL_IMAGES, already included…

a2c1c02

… in K3s airgap bundle

fix(installer): load offline images before deploying GPU device plugin

3e11295

fix(offline): move hub image override after custom block in values ov…

7895da6

…erlay hub.image was incorrectly nested inside custom.resources.images block, causing metadata to be misinterpreted as hub.image property and triggering Helm schema validation failure.

fix(offline): fail fast when image import fails in load_offline_images

0e40738

Silent warning on import failure could leave the cluster with missing images that cause pod failures at runtime. Now exits immediately so the user sees a clear error instead of a mysteriously broken install.

fix(ci): sanitize branch name for Docker image tag

993c88d

github.ref_name for feature branches contains '/' (e.g. feature/offline-pack) which is invalid in Docker tags. Replace '/' with '-' when using branch name as default IMAGE_TAG.

fix(ci/pack): silently sanitize IMAGE_TAG by replacing '/' with '-'

ad57e6b

Branch names like 'feature/offline-pack' are invalid Docker tags. Both the workflow and pack_bundle now auto-replace '/' with '-' so no manual sanitization is needed by the caller.

ci(pack): remove slow file count in verify step

fda559b

ci(pack): simplify verify step to filename and size only

7ef70b8

ci(pack): allow fork repo in release trigger condition

832e749

ci(pack): derive IMAGE_REGISTRY from repository owner

0059856

ci(pack): lowercase repository owner for image registry

4afa48f

ci(pack): continue-on-error for release asset upload

9603fac

ci(pack): skip release upload if no release exists, don't auto-create

d8fcd10

ci(pack): skip GPU if bundle already exists in release

8b4540f

ci(pack): remove auto-create release option from manual job

2a0f2ba

MioYuuIH and others added 29 commits March 11, 2026 10:33

fix: validate JSON body in group API handlers and clarify fallback co…

5710db5

…mment - Add try-except for json.loads in PATCH/POST/DELETE group handlers, returning 400 instead of 500 on malformed request bodies - Clarify the native-user resource fallback as defensive code

fix: unify sync timing description across UI

c4eb3cd

Clarify that team data is captured at login and group membership is updated at spawn time. Both banners now use consistent wording.

feat(admin): collapse resource badges in group list with expand toggle

89d7c1b

Merge pull request #43 from AMDResearch/feat/github-teams-groups-inte…

beb13a9

…gration feat: unify GitHub Teams and JupyterHub Groups with resource isolation

fix(groups): remove official short-circuit so all groups contribute t…

0a2190c

…o resource union

chore(config): add PhySim, cpu, gpu resources to native-users mapping

b28691b

Merge pull request #46 from AMDResearch/fix/uninstall-docker-containe…

86ab18d

…rs-cleanup fix(installer): clean up Docker containers left by k3s on uninstall

Merge pull request #47 from AMDResearch/feat/allowed-origins-config

f851f94

feat(config): add hub.allowedOrigins and notebook.allowedOrigins for origin policy

feat(admin): add pending spawns panel, DAU trend line, and idle sessi…

99a916d

…on warnings

fix(quota): always initialize QuotaManager for session tracking regar…

b2a5423

…dless of quota_enabled

feat(admin): add hourly session distribution bar chart

7c4ca81

fix(stats): fill zero values for days/weeks with no sessions in usage…

0ad445b

… time series

fix(admin): add missing UserDetailModal and ThemeToggle components

ddd8061

Merge pull request #48 from AMDResearch/feat/admin-dashboard

f9ee942

feat(admin-dashboard): add usage dashboard and improve admin insights

KerwinTsaiii closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/fix dummy0 route breaks network#49

Bug/fix dummy0 route breaks network#49
KerwinTsaiii wants to merge 66 commits intomainfrom
bug/fix-dummy0-route-breaks-network

KerwinTsaiii commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KerwinTsaiii commented Apr 1, 2026

Summary

Changes

Testing

Files Changed

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants