Skip to content

[EPIC] Folder quota follow-ups #24

@Vicente-Cheng

Description

@Vicente-Cheng

Tracking issue for follow-up work that emerged during the folder-quota PR (#23) but was deliberately deferred to keep that PR focused. Sub-issues will be opened for each item below as we pick them up; this is the master list.

Related: #22, #23


🔴 Security

Sweep all FSAL methods for handle-reuse + symlink-swap path traversal

PR #23 added a validate_path() re-check in LocalFilesystem::write() and LocalFilesystem::setattr_size() (PR comments #39 / #40), but the same exposure exists in every method that calls resolve_handle() and then operates on the returned path:

  • read()
  • setattr_mode() — particularly dangerous: arbitrary chmod outside the export
  • setattr_owner() — particularly dangerous: arbitrary chown outside the export
  • commit()
  • link() (source side)
  • readlink(), readdir(), getattr()
  • symlink(), mknod()

Attack: client looks up pvc-a/file, retains the handle, replaces the path with a symlink pointing outside the export, then reuses the handle — OpenOptions::open follows the symlink and operates on the wrong file.

Fix: add self.validate_path(&path)?; immediately after resolve_handle() in every affected method. Mostly mechanical (~10 sites).

Tighten the validate_path → open TOCTOU

validate_path and the subsequent open are not atomic — a sufficiently fast attacker can swap a symlink in between. Mitigations to evaluate:

  • O_NOFOLLOW on opens that should never traverse a link
  • openat2(RESOLVE_NO_SYMLINKS | RESOLVE_BENEATH) on Linux ≥ 5.6
  • nix or libc wrapper as needed

🟡 Correctness

Cross-PVC hard link accounting

link() does not run through write/create's quota path, so a hard link can be created across PVCs (or to a file in an untracked area) without the target PVC being charged. After the link, all writes through any name share the same inode blocks but are charged to whichever PVC the writer's path resolves to. Decide on policy:

  • (a) refuse to create cross-PVC hard links when quota is enabled; or
  • (b) charge the destination PVC at link time (and refund only on the last unlink — already tightened via refundable_bytes_on_unlink).

PR #23 already handles the unlink side correctly via nlink checks (PR comment #38).

Atomic check-and-reserve API

check_quota + add_usage are two separate calls, so concurrent writers can both pass the pre-check and then both account, briefly exceeding the limit by up to (N-1) × max_write_size. Documented in check_quota's doc-comment as an accepted limitation for the PVC use case (PR comments #7, #27). If we ever need strict enforcement:

  • add try_add_usage(dir, bytes) that holds the write lock across check + increment + persist
  • callers (write, setattr_size, rename) switch to it
  • needs a refund-on-write-failure path

Fail-closed mode for quota persistence failures

PR #23 chose fail-open after add_usage persist failures — log a warning and let the FS operation succeed (PR comments #16, #19, #22). Good for transient redb hiccups, but if redb is genuinely broken the quota silently stops enforcing until a restart.

Optional follow-up: a quota.fail_mode = \"open\" | \"closed\" config knob, plus a health flag that flips the manager into fail-closed after N consecutive persist failures (with a clear log line so ops notices).


🔵 Performance

Per-PVC lock granularity

Today QuotaManager serialises every quota mutation on a single RwLock<HashMap> write lock held across the redb commit. Fine for one or a few PVCs; could be a bottleneck for high-fan-out deployments. Options:

  • dashmap::DashMap for the in-memory cache (sharded locks under the hood)
  • hand-rolled HashMap<String, Arc<Mutex<QuotaEntry>>> with the outer map only locked for entry insert/remove

Note that redb itself is single-writer, so this only relieves the cache layer. If that becomes the bottleneck, the next step is per-PVC redb files or a backend that supports concurrent writers.


🟢 Operability / Features

Periodic background reconciliation

Reconciliation currently runs once at startup. Out-of-band changes (e.g. an admin running cp on the host) only get repaired at the next restart. Add a timer that re-runs reconcile_all on a configurable interval (e.g. quota.reconcile_interval_secs).

Admin CLI / runtime quota management

Quota limits today can only be set via the [quota.bootstrap] config block, which is idempotent (only inserts when missing). Missing operator capabilities:

  • update an existing PVC's limit at runtime
  • query current usage
  • remove a quota entry (the remove_quota API exists but has no entry point)
  • trigger reconciliation on demand

Form factor: a separate arcticwolf-quota binary, or an admin RPC over a unix socket.

Telemetry

No external observability for quota state right now. Suggest a Prometheus / OpenMetrics endpoint exporting per-PVC limit_bytes, usage_bytes, and usage_ratio, so ops can alert on PVCs approaching their limit.

Quota visibility to NFS clients

FSSTAT already returns quota-aware byte counts (PR #23 covered this), which df on a mounted PVC honours. Could go further with NFSv4 dacl extensions or an out-of-band query mechanism, but that's a much bigger lift and probably not worth it for NFSv3.


🟠 Tests / Docs

Extend integration tests

tests/test_nfs_quota.sh covers under-limit write, over-limit EDQUOT, remove-then-write, truncate-then-write, and df. Missing end-to-end coverage:

  • cross-PVC rename
  • hard link behaviour (stays charged across the first unlink, refunds on the last)
  • symlink-swap rejection (mirrors of the new unit tests added in PR Add folder quota #23)

nfstest_posix --runtest=read,write is also a small subset of the POSIX suite — broaden over time as quota-relevant areas stabilise.

Operator documentation

Add a dedicated docs/quota.md:

  • configuration reference, including the requirement that db_path lives outside the export tree
  • sizing guidance (limits land on 4 KiB block boundaries on tmpfs/ext4 — values that are not block-aligned will round)
  • known limitations: bounded over-shoot under concurrency, hard-link cross-PVC behaviour, interaction with underlying OS quotas (XFS project quotas etc.)
  • reconciliation behaviour and how to invoke it (once we add a manual trigger)

Non-4 KiB filesystem support

Quota unit tests use BLOCK = 4096 and assume tmpfs/ext4 page size. On filesystems with different allocation units (ZFS records, btrfs CoW, network FS) the exact comparisons would need to be relaxed to ranges. Currently called out in a code comment; would be good to actually exercise on at least XFS in CI.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions