Skip to content

Conversation

@devin-ai-integration
Copy link

Tracking issue

Related to internal request from carlos@exa.ai

Link to Devin session: https://app.devin.ai/sessions/32b236a03baa490dac3abf979f08667d

Why are the changes needed?

This adds automated NFS storage cleanup to prevent unbounded growth of stale data on shared storage volumes across two different cluster environments (exa-cluster and cirrascale).

What changes were proposed in this pull request?

Added a new example workflow (examples/nfs_ttl_cleanup.py) that implements TTL-based directory cleanup for NFS mounts with the following features:

Core functionality:

  • Scans top-level directories in a configurable base path
  • Deletes directories where ALL files have not been accessed within the TTL period (default: 28 days/4 weeks)
  • Uses file access time (st_atime) to determine staleness
  • Returns statistics on deleted/skipped directories

Dual-cluster support:

  • exa-cluster: Uses PVC mount (nfs-pvc)
  • cirrascale: Uses direct NFS mount (172.18.72.200:/export/metaphor)
  • Each cluster has its own task with appropriate pod templates and node selectors

Launch plans:

  • Two launch plans scheduled to run daily at midnight UTC
  • Configurable TTL (default 4 weeks), base path, and dry-run mode
  • Resource allocation: 4 CPU, 8Gi memory per task

Documentation:

  • Comprehensive README with usage examples, safety considerations, and configuration options

How was this patch tested?

⚠️ Limited testing performed - Python syntax validation only. Full runtime testing was not possible due to missing dependencies in the development environment.

What was verified:

  • Python syntax validation passes
  • Code structure follows existing flytekit examples (based on monorepo patterns)

What needs verification:

  1. Cluster-specific values (NFS server IP, PVC names, paths) match your production environment
  2. Deletion logic correctly identifies stale directories
  3. File access time (st_atime) is reliably updated on your NFS mounts
  4. Resource allocations are appropriate
  5. Running as root (UID 0) on cirrascale is acceptable for your security requirements

Recommended testing approach:

  1. First run with dry_run=True to validate what would be deleted
  2. Test on a non-production NFS mount
  3. Verify with a small TTL value (e.g., 1 day) before deploying with 28-day default

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed. (No tests added - example workflow)
  • All commits are signed-off.

Human Review Checklist

Please carefully review the following potentially dangerous aspects:

  1. Deletion safety: The workflow permanently deletes directories. Verify the logic in should_delete_directory() is correct.

  2. Hardcoded values: Confirm these match your infrastructure:

    • NFS server: 172.18.72.200:/export/metaphor (cirrascale)
    • PVC name: nfs-pvc (exa-cluster)
    • Node selectors: cluster: "exa-cluster" and cluster: "cirrascale"
  3. File access time reliability: The workflow uses st_atime. This may not work correctly if NFS is mounted with noatime or similar options.

  4. Scope limitation: Only top-level directories in base_path are checked, not nested subdirectories.

  5. Error handling: Permission errors cause directories to be skipped (marked as "active"). Is this the desired behavior?

  6. Security: The cirrascale task runs as root (UID 0, GID 0). Verify this is acceptable.

  7. Schedule & TTL: Default is daily at midnight UTC with 28-day TTL. Confirm these values are appropriate.

- Add workflow to clean up old directories from NFS storage based on TTL
- Support for two clusters: exa-cluster (PVC mount) and cirrascale (direct NFS mount)
- Configurable TTL (default: 4 weeks / 28 days)
- Daily scheduled execution at midnight UTC
- Dry run mode for testing before actual deletion
- Comprehensive documentation in README

The workflow scans directories and deletes those where all files
haven't been accessed within the TTL period. Two launch plans are
configured for the different clusters with appropriate NFS mounting
and node selection.

Co-Authored-By: carlos@exa.ai <carlos@exa.ai>
@devin-ai-integration
Copy link
Author

Original prompt from carlos
Received message in Slack channel #devin-land:

@Devin

• write a flyte workflow with two launchplans, one for exa-cluster and one for cirrascale
• it should correctly mount the corresponding NFS and select the correct cluster
• you should be able to define a TTL and default to 4 weeks
• it should delete all directories where all files haven't been accessed in that TTL period
• make it run every day

Thread URL: https://T0274T299K6.slack.com/archives/C090Z3VH487/p1760486286506579?thread_ts=1760486286.506579

@devin-ai-integration
Copy link
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR that start with 'DevinAI' or '@devin'.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants