A sophisticated Discord Bot designed to monitor High-Performance Computing (HPC) clusters managed by Slurm. It bridges the gap between complex terminal outputs and user-friendly visualizations, utilizing Google Gemini AI to provide human-readable summaries and Matplotlib for deep analytics.
- Gemini 2.5 Integration: Instead of raw numbers, get intelligent summaries like "huk120 is wide open with 128GB RAM" or "
⚠️ High CPU Load on Partition Alto". - Smart Parsing: Converts raw
sinfo/squeuedata into concise, emoji-coded updates.
/history(Stacked Area Chart): Visualizes the cluster's Capacity vs Usage over the last 24 hours. Categories: 🟢 Idle, 🟡 Mixed, 🔴 Allocated, ⚫ Down./heatmap(Utilization Grid): A temporal heatmap showing the exact state of every single node over time. Perfect for spotting stuck nodes or usage patterns./status(Dashboard): Instant traffic-light view of all partitions and nodes.
- Deep Inspection: SSHs directly into nodes to get real-time hardware stats (Exact RAM/CPU).
- Context Aware:
- If Busy: Tells you who is running what job and for how long.
- If Idle: Tells you how long it has been idle (e.g., "Idle since 14:30").
- Job Completion: Pings you the moment your specific job finishes or crashes.
- Auto-Discovery: Automatically finds new partitions and nodes—no configuration needed.
- Resilience: Handles SSH timeouts, Bastion jumps, and connection drops gracefully.
The bot is built on a modular Cogs architecture to ensure scalability and robustness.
graph TD
User((User)) -->|Slash Command| Discord[Discord API]
Discord -->|Interaction| Bot[Bot Entry Point]
subgraph "🤖 Slurm Bot Core"
Bot --> Loader{Cogs Loader}
Loader -->|Load| CogMon[Slurm Mon Cog]
Loader -->|Load| CogCmd[Commands Cog]
Loader -->|Load| CogAna[Analytics Cog]
CogMon -->|State Tracking| Data[(History Data)]
CogAna -->|Read & Plot| Data
CogMon -->|Summarize| Gemini[Google Gemini AI]
end
subgraph "� HPC Infrastructure"
CogCmd -->|Request Info| Client[Slurm Client]
CogMon -->|Poll Status| Client
Client -->|SSH Tunnel| Bastion[Bastion Host]
Bastion -->|SSH| Head[Head Node]
Head -->|sinfo/squeue| Compute[Compute Nodes]
end
CogAna -->|Upload Graph| Discord
Gemini -->|Summary Text| Discord
server-notification/
├── bot_entry.py # 🚀 Main Entry Point (Loads Cogs & Starts Bot)
├── deploy.py # �️ Deployment Automation (Updates & Restarts Systemd)
├── utils/
│ └── slurm_client.py # � SSH Client (Context Managers, Retries, Parsing)
├── cogs/
│ ├── slurm_mon.py # 🔄 Background Loop (Polling, AI Logic, Alerts)
│ ├── analytics.py # � Data Science (Pandas, Matplotlib, Heatmaps)
│ └── commands.py # 💬 Slash Commands (/status, /inspect, etc.)
├── data/
│ ├── history.csv # � Aggregate Stats (for /history)
│ └── node_history.jsonl # 📜 Granular Node Data (for /heatmap)
├── requirements.txt # � Dependencies (Pandas, Fabric, Discord.py)
└── .env # 🔑 Configuration Secrets
Create a .env file in the root directory (template: .env.example).
| Variable | Description |
|---|---|
SSH_PASSWORD_HUK |
Password for the Head Node. |
SSH_PASSWORD_BASTIAO |
Password for the Bastion Host. |
DISCORD_BOT_TOKEN |
Token from Discord Developer Portal. |
GEMINI_API_KEY |
Google AI Studio Key for intelligent summaries. |
CHECK_INTERVAL |
Polling frequency in seconds (default: 300). |
TARGET_CLUSTER_USER |
Slurm username to track for job alerts. |
DISCORD_USER_ID |
Your Discord ID for personal pings. |
git clone <repo-url>
cd server-notification
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtUse the included automation script to update code, install dependencies, and restart the service:
python3 deploy.py| Command | Description |
|---|---|
| /status | 🟢 Visual dashboard of all partitions. |
| /queue | 📜 Leaderboard of active jobs and users. |
| /history | 📈 Stacked Area Chart of cluster capacity (24h). |
| /heatmap | 🔥 Temporal heatmap of node utilization. |
/inspect node |
🕵️ Deep dive into a specific node (CPU/RAM/Job). |
This project is architected to evolve into a Model Context Protocol (MCP) Server.
By exposing SlurmClient as a tool, external Agents (like Claude or ChatGPT) could:
- Read the cluster state (
get_node_states). - Reason about resource availability ("Huk120 is free and has 256GB RAM").
- Act by scheduling jobs optimally (
sbatch).