Skip to content

Bug/fix dummy0 route breaks network#49

Closed
KerwinTsaiii wants to merge 66 commits intomainfrom
bug/fix-dummy0-route-breaks-network
Closed

Bug/fix dummy0 route breaks network#49
KerwinTsaiii wants to merge 66 commits intomainfrom
bug/fix-dummy0-route-breaks-network

Conversation

@KerwinTsaiii
Copy link
Copy Markdown
Collaborator

Summary

Changes

Testing

Files Changed

Checklist

  • Code follows project style guidelines
  • Changes are backward compatible
  • Tested on local Kubernetes cluster
  • Documentation links updated

MioYuuIH and others added 30 commits March 6, 2026 10:24
…options

Add CLI flag parsing (--gpu, --docker, --mirror, --mirror-pip, --mirror-npm)
as alternatives to environment variables. Flags can appear in any position
and override corresponding env vars.
feat(installer): support --key=value CLI flags
Add gfx1201/rdna4/dgpu to resolve_gpu_config mapping (GPU_TARGET=gfx120x).
Add gfx120x to CI build-config.json (pytorch_whl: gfx120X-all).
ROCm tarball and PyTorch wheels confirmed available at repo.amd.com.
Add 'pack' command to create self-contained offline deployment bundles.
Support all 4 image source × deploy target combinations:

  install          build locally + deploy (existing)
  install --pull   pull from GHCR + deploy (new)
  pack             pull from GHCR + save to bundle (new)
  pack --local     build locally + save to bundle (new)

Offline bundles include K3s binary/images, Helm, K9s, ROCm device
plugin manifest, all container images, and Helm chart+values.
Auto-detected via manifest.json when running from bundle directory.

Add pack-bundle.yml CI workflow for manual bundle creation.
- Add IMAGE_REGISTRY env var (default: ghcr.io/amdresearch) for
  configurable image source in pack and install --pull
- Pack now exits with error if any custom or external image fails
  to pull, preventing incomplete bundles
- Add image_registry input to pack-bundle CI workflow
- Read IMAGE_REGISTRY from bundle manifest for offline installs
Support pulling images with non-default tag prefixes (e.g. develop-gfx1151
instead of latest-gfx1151). The IMAGE_TAG is stored in the bundle manifest
and restored on offline install. Default remains "latest".
…erlay

hub.image was incorrectly nested inside custom.resources.images block,
causing metadata to be misinterpreted as hub.image property and
triggering Helm schema validation failure.
… for consistency

Both pull and local-build modes now save hub/default images with
:latest and :${IMAGE_TAG} tags, matching GPU image behavior.
This ensures values.local.yaml references always resolve regardless
of which IMAGE_TAG was used during pack.
Silent warning on import failure could leave the cluster with missing
images that cause pod failures at runtime. Now exits immediately
so the user sees a clear error instead of a mysteriously broken install.
- Remove redundant CUSTOM_IMAGES/IMAGES arrays; GPU_CUSTOM_NAMES and
  PLAIN_CUSTOM_NAMES are the single source of truth for image lists
- Fix typo: deply_aup_learning_cloud_runtime -> deploy_aup_learning_cloud_runtime
- Remove duplicate generate_values_overlay call in deploy function
  (orchestration now handled exclusively by callers)
- Remove unused check_root function; inline root check at entry points
  of deploy_all_components and pack_bundle
- Add missing section headers for Runtime Management group
- rt install/reinstall and legacy install-runtime now correctly call
  detect/get_paths/generate_overlay before deploy
- Merge gpu_target + gpu_type into single gpu_type choice; installer
  derives GPU_TARGET internally via resolve_gpu_config
- Add rdna4 option (gfx120x) to match upstream installer support
- image_tag now defaults to current branch name (github.ref_name)
  so develop branch packs use 'develop' tag automatically
- Use env: block instead of inline var prefix for cleaner CI syntax
- Remove root check from pack_bundle; pack only needs docker/wget,
  not root access (install still requires root)
github.ref_name for feature branches contains '/' (e.g. feature/offline-pack)
which is invalid in Docker tags. Replace '/' with '-' when using branch
name as default IMAGE_TAG.
Branch names like 'feature/offline-pack' are invalid Docker tags.
Both the workflow and pack_bundle now auto-replace '/' with '-'
so no manual sanitization is needed by the caller.
- Add workflow_run trigger: fires after 'Build Docker Images' completes,
  ensuring all images (hub, base, courses) are built before packing starts
- pack-release job: matrix over all 4 GPU types, only runs on v* tags
  pushed to AMDResearch/aup-learning-cloud (main repo guard)
- pack-release attaches bundles to the existing GitHub Release
- pack-manual job: unchanged workflow_dispatch flow for manual testing
- Fix tar SIGPIPE false error in verify step (2>/dev/null)
… layers

Course images (cv/dl/llm/physim) all share auplc-base layers. Saving them
separately caused those layers to be written N times. A single docker save
call with all image refs deduplicates shared layers automatically, reducing
bundle size significantly.
Ensures any v* tag (semver or not) gets pushed with the exact tag name.
Previously non-semver tags (e.g. v0.1-test) would only get sha-based tags,
causing course image builds to fail when looking for the base image by tag.

Also removes main repo restriction from pack-release trigger condition.
MioYuuIH and others added 29 commits March 11, 2026 10:33
- Remove unused ensure_system_group() function (replaced by load_groups)
- Add Release Protection button for protected groups in EditGroupModal
- Add release_protection PATCH support in handlers.py
- Move RESERVED_KEYS to module-level constant in EditGroupModal
- Update info banner with sync timing and release protection docs
- Improve lazy backfill comment in admin groups API handler
- Add release_protection to updateGroup API type signature
- Pre-create native-users group via load_groups at startup
…mment

- Add try-except for json.loads in PATCH/POST/DELETE group handlers,
  returning 400 instead of 500 on malformed request bodies
- Clarify the native-user resource fallback as defensive code
GitHub-team groups now allow admins to add/remove members manually.
This lets native users be added to GitHub-team groups to grant them
the same resources. Synced GitHub members are still auto-managed
(re-added/removed on next login).

- is_readonly_group() now only returns true for system groups
- Update UI placeholders and info banners to explain sync behavior
- Include unit tests for group protection and sync logic
Clarify that team data is captured at login and group membership
is updated at spawn time. Both banners now use consistent wording.
- Add POST /admin/api/groups/sync endpoint that fetches fresh GitHub
  teams for all users and syncs group memberships immediately
- Add Sync Now button in admin UI with loading state and result summary
- Update info banner to mention Sync Now as an alternative to waiting
  for user login/spawn
… filtering

The custom spawn.html template referenced a non-existent `spawner_options`
variable, so `window.AVAILABLE_RESOURCES` was never set. The React spawn
app then showed all resources to every user regardless of group membership.

Fix: `options_form()` now returns a `<script>` tag that injects
`AVAILABLE_RESOURCES` and `SINGLE_NODE_MODE`, and `spawn.html` renders
it via `{{ spawner_options_form | safe }}`.

Also includes UI polish:
- Dark mode: use outline-secondary for Sync Now / Manage Teams buttons
- Sync result alert auto-dismisses after 5 seconds
- Info banner is dismissible (persisted via localStorage) with a
  toggle button to re-open it
- Replace db: object with db: Session (TYPE_CHECKING import)
- Add assert user.orm_user is not None before accessing .groups
- Add type: ignore[assignment] on Group.properties assignments
  (SQLAlchemy Column type not compatible with plain dict)
- Add type: ignore[union-attr] on Group.properties.get() calls
Previously teams were cached in auth_state at login and reused on every
spawn, so removing a user from a GitHub team would not take effect until
they logged out and back in.

Now auth_state_hook fetches fresh teams from GitHub at each spawn and
updates the cache. refresh_user() also refreshes teams when proactively
refreshing an expiring token.
…gration

feat: unify GitHub Teams and JupyterHub Groups with resource isolation
k3s-uninstall.sh only handles its embedded containerd runtime. When k3s
is configured with --docker, Pod containers appear in `docker ps` with
a k8s_ prefix and are silently skipped by the uninstall script.

Add remove_k3s_docker_containers() to stop and remove all k8s_* Docker
containers after k3s-uninstall.sh runs. The function lists affected
containers and prompts for confirmation before removing. Behaviour in
non-interactive environments (CI/CD, pipes):
- stdin is not a TTY: skip automatically, print manual cleanup command
- --yes / -y flag or AUPLC_YES=1: remove without prompting

Upstream issue: k3s-io/k3s#1469
…ntainers

Replacing the 'name=k8s_' substring filter with a label-based filter on
'io.kubernetes.pod.name', which kubelet stamps on every container it
creates via dockershim/cri-dockerd. This prevents accidentally matching
user-created containers that happen to have 'k8s_' in their name.
…rs-cleanup

fix(installer): clean up Docker containers left by k3s on uninstall
…ocket origins

Add a native YAML-based configuration field `custom.allowedOrigins` to control
allowed origins for notebook server WebSocket connections, replacing the need
for raw Python injection via `hub.extraConfig`.

- ParsedConfig: add `allowedOrigins: list[str]` field
- HubConfig: expose `allowed_origins` property
- RemoteLabKubeSpawner: inject `--ServerApp.allow_origin_pat` (and
  `--ServerApp.allow_origin=*` when wildcard is set) into notebook server
  startup args at spawn time
- chart/values.yaml: document new field under `custom`
- chart/values.schema.yaml + values.schema.json: add schema definition
- runtime/values.yaml: add commented usage examples
Replace the flat `custom.allowedOrigins` field with two clearly scoped fields:

- `custom.hub.allowedOrigins`: sets Access-Control-Allow-Origin on Hub HTTP
  responses via JupyterHub tornado_settings
- `custom.notebook.allowedOrigins`: injected into each notebook server's startup
  args via --ServerApp.allow_origin_pat, targeting kernel WebSocket connections

Naming avoids confusion with the z2jh top-level `singleuser` section.
Schema updated in values.schema.yaml and values.schema.json.
feat(config): add hub.allowedOrigins and notebook.allowedOrigins for origin policy
- Add Dashboard page to Admin UI with Recharts line/pie charts and
  Bootstrap table showing daily usage, resource distribution, and
  top users ranking
- Add Tailwind v4 with tw: prefix to Admin UI for modern card layouts
  without conflicting with Bootstrap 5
- Add NavBar component shared across Users/Groups/Dashboard pages
- Add shared TypeScript types and API helpers for stats endpoints
- Add three stats API handlers: overview, usage time series,
  distribution breakdown
- Decouple UsageSession writes from quota system: sessions are now
  always recorded regardless of quota_enabled, so all deployments
  get dashboard data by default
- New Dashboard page: date range picker, usage trend chart (daily/weekly),
  course usage ranking with avg session duration, top users table
- Active Now panel: SSE-based live feed of running sessions (5s refresh)
- NavBar: unified tab navigation across Users/Groups/Dashboard pages;
  removed redundant 'Manage Groups' / 'Back to Users' buttons
- stats_handlers: add StatsActiveSSEHandler, avg_minutes in distribution,
  ActiveSession type; remove redundant REST active endpoint
- spawner stop(): fallback session recovery when usage_session_id lost
  after Hub restart (finds active DB session by username)
- dark mode: use Bootstrap semantic classes instead of Tailwind dark variant
…ion times

- Backend StatsHourlyHandler accepts tz_offset (minutes ahead of UTC) and
  applies SQLite datetime offset before extracting hour, so the distribution
  reflects the viewer's local time instead of server UTC
- Frontend getHourlyDistribution sends tz_offset derived from
  new Date().getTimezoneOffset() (negated to get offset from UTC)
- Active session start_time is now parsed as UTC and displayed in the
  browser's local timezone using toLocaleString()
- store accelerator display labels alongside session data so stats APIs surface both course and accelerator names
- extend shared stats types to expose `accelerator_display`
- update admin dashboard tables and user detail modal to render friendly course + accelerator labels with
fallbacks

The change ensures admins always see which course and accelerator a user is consuming, even while quota tracking
still keys sessions by accelerator.
feat(admin-dashboard): add usage dashboard and improve admin insights
…ectivity

The dummy0 interface adds a default route with metric 1000, which takes
priority over the real network interface (e.g. WiFi at metric 20600),
causing all traffic to route through a virtual interface with no actual
connectivity. Remove the unnecessary default route from both the install
step and the systemd service — dummy0 only needs to provide a stable IP
for K3s node binding.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants