A distributed job scheduling system designed for academic research computing workloads. Features priority-based fair-share scheduling, resource management, and concurrent job execution.
This project is a production-grade REST API that simulates a job scheduling system for research computing clusters. It implements sophisticated scheduling algorithms including fair-share resource allocation, priority-based queuing, and automatic resource matching.
Built as a learning project to demonstrate:
- RESTful API design and implementation
- Distributed systems concepts (scheduling, resource management)
- Database design and SQL optimization
- JWT authentication and authorization
- Concurrent programming with goroutines
- Infrastructure software development
- RESTful API for job submission and management
- JWT Authentication with secure token generation and validation
- Priority-based scheduling with configurable job priorities
- Fair-share algorithm - ensures equitable resource distribution across research groups
- Resource matching - automatically matches jobs to workers with sufficient CPU, memory, and GPU
- Concurrent execution - runs multiple jobs simultaneously with configurable limits
- Real-time monitoring - track job status (pending β running β completed/failed)
- User isolation - users can only view and manage their own jobs
- Usage tracking - logs CPU hours for fair-share calculations
The scheduler uses a sophisticated multi-factor priority calculation:
final_priority = base_priority Γ fair_share_multiplier Γ wait_time_boost
Where:
- base_priority: User + group priority (1-10)
- fair_share_multiplier: quota / actual_usage (prevents resource hogging)
- wait_time_boost: 1 + (wait_minutes / 60 * 0.01) (prevents starvation)
Example:
- Group A used 90% of quota β fair_share = 1.11 (slight boost)
- Group B used 25% of quota β fair_share = 2.0 (high boost)
- Job waiting 10 hours β wait_boost = 1.10
- Result: Group B's jobs get scheduled first, older jobs gradually gain priority
βββββββββββββββ HTTP/REST ββββββββββββββββ
β Client β ββββββββββββββββββ> β API Server β
β (curl, β <ββββββββββββββββββ β (Go/Gin) β
β Postman) β JSON ββββββββ¬ββββββββ
βββββββββββββββ β
β
βββββββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ ββββββββββββ
β PostgreSQL β β Scheduler β β File β
β Database β β (Goroutine)β β Storage β
β β β β β β
β - Users β β - Priority β β - Logs β
β - Jobs β β - Matching β β - Output β
β - Groups β β - Fair-shareβ ββββββββββββ
β - Workers β β - Executor β
βββββββββββββββ βββββββββββββββ
API Server (Go + Gin)
- Handles HTTP requests and responses
- JWT authentication middleware
- Request validation and error handling
- Routes:
/auth,/jobs,/queue,/admin
Scheduler (Background Goroutine)
- Runs every 30 seconds (configurable)
- Fetches pending jobs from database
- Calculates priorities using fair-share algorithm
- Matches jobs to available workers
- Starts job execution and tracks completion
Database (PostgreSQL)
- Stores users, groups, jobs, workers
- Tracks resource usage for fair-share
- ACID transactions for job state changes
- Indexed for fast queries
| Component | Technology | Purpose |
|---|---|---|
| Language | Go 1.21+ | High-performance, concurrent programming |
| Web Framework | Gin | Fast HTTP routing and middleware |
| Database | PostgreSQL 15 | Relational data storage with ACID guarantees |
| Authentication | JWT (golang-jwt) | Stateless authentication |
| Password Hashing | bcrypt | Secure password storage |
| Containerization | Docker | Database isolation and portability |
| API Design | REST | Standard HTTP methods and status codes |
- Go 1.21 or higher - Install Go
- PostgreSQL 15 or higher - Via Docker (recommended) or local install
- Docker (optional but recommended) - Install Docker
- Git - For cloning the repository
- curl or Postman - For testing API endpoints
git clone https://github.com/YOUR_USERNAME/research-compute-queue.git
cd research-compute-queuego mod downloadOption A: Using Docker (Recommended)
# Start PostgreSQL container
docker run --name research-queue-db \
-e POSTGRES_PASSWORD=dev123 \
-e POSTGRES_DB=research_queue \
-p 5432:5432 \
-d postgres:15
# Verify it's running
docker psOption B: Local PostgreSQL
# macOS with Homebrew
brew install postgresql@15
brew services start postgresql@15
createdb research_queue
# Ubuntu/Debian
sudo apt install postgresql-15
sudo systemctl start postgresql
sudo -u postgres createdb research_queue# Using Docker
docker exec -i research-queue-db psql -U postgres -d research_queue < scripts/setup_db.sql
# Using local PostgreSQL
psql -U postgres -d research_queue -f scripts/setup_db.sqlYou should see:
CREATE TABLE
CREATE TABLE
CREATE TABLE
...
INSERT 0 3
INSERT 0 3
# Copy example config
cp .env.example .env
# Edit .env with your values
# Make sure DATABASE_URL matches your setup.env file:
DATABASE_URL=postgres://postgres:dev123@localhost:5432/research_queue?sslmode=disable
PORT=8080
ENVIRONMENT=development
JWT_SECRET=your-secret-key-change-in-production
JWT_EXPIRY_HOURS=24
SCHEDULER_INTERVAL_SECONDS=30
MAX_CONCURRENT_JOBS=10
LOG_DIRECTORY=./logs
OUTPUT_DIRECTORY=./outputgo run cmd/server/main.goExpected output:
========================================
Research Compute Queue API
Environment: development
========================================
β Database connection established
β JWT manager initialized
β Directories created
β Scheduler started (interval: 30s, max concurrent: 10)
β API server starting on port 8080
========================================
System is ready!
API: http://localhost:8080
Press Ctrl+C to stop
========================================
http://localhost:8080
All /api/jobs endpoints require a valid JWT token in the Authorization header:
Authorization: Bearer <your_jwt_token>
Check API Status
GET /healthResponse:
{
"status": "healthy",
"message": "Research Compute Queue API is running",
"version": "1.0.0"
}POST /api/auth/register
Content-Type: application/json
{
"email": "user@example.com",
"password": "securepassword123",
"group_id": 1
}Response:
{
"message": "User registered successfully",
"user_id": 2
}POST /api/auth/login
Content-Type: application/json
{
"email": "user@example.com",
"password": "securepassword123"
}Response:
{
"message": "Login successful",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"user": {
"id": 2,
"email": "user@example.com",
"group_id": 1,
"is_admin": false
}
}POST /api/jobs
Authorization: Bearer <token>
Content-Type: application/json
{
"script": "python train_model.py --epochs 100",
"cpu_cores": 8,
"memory_gb": 32,
"gpu_count": 1,
"estimated_hours": 4.5,
"priority": 3
}Response:
{
"message": "Job submitted successfully",
"job_id": 1,
"status": "pending"
}GET /api/jobs/{job_id}
Authorization: Bearer <token>Response:
{
"id": 1,
"user_id": 2,
"group_id": 1,
"script": "python train_model.py --epochs 100",
"cpu_cores": 8,
"memory_gb": 32,
"gpu_count": 1,
"status": "running",
"priority": 3,
"submitted_at": "2026-01-08T15:30:00Z",
"started_at": "2026-01-08T15:30:30Z",
"completed_at": null,
"worker_id": 2
}GET /api/jobs?status=running&limit=10
Authorization: Bearer <token>Query Parameters:
status(optional): Filter by status (pending,running,completed,failed,cancelled)limit(optional): Max number of results (default: 50)
Response:
{
"jobs": [
{
"id": 1,
"status": "running",
"script": "python train_model.py",
"cpu_cores": 8,
"submitted_at": "2026-01-08T15:30:00Z"
}
],
"count": 1
}DELETE /api/jobs/{job_id}
Authorization: Bearer <token>Response:
{
"message": "Job cancelled successfully",
"job_id": 1
}Save this as test.sh:
#!/bin/bash
API="http://localhost:8080"
echo "=== Testing Research Compute Queue API ==="
# 1. Health check
echo -e "\n1. Health Check:"
curl -s $API/health | jq
# 2. Register user
echo -e "\n2. Register User:"
curl -s -X POST $API/api/auth/register \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"password123","group_id":1}' | jq
# 3. Login and get token
echo -e "\n3. Login:"
TOKEN=$(curl -s -X POST $API/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"password123"}' \
| jq -r '.token')
echo "Token: ${TOKEN:0:50}..."
# 4. Submit job
echo -e "\n4. Submit Job:"
JOB_ID=$(curl -s -X POST $API/api/jobs \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"script":"python test.py","cpu_cores":4,"memory_gb":16,"priority":3}' \
| jq -r '.job_id')
echo "Created Job ID: $JOB_ID"
# 5. Get job status
echo -e "\n5. Get Job Status:"
curl -s $API/api/jobs/$JOB_ID \
-H "Authorization: Bearer $TOKEN" | jq
# 6. List all jobs
echo -e "\n6. List All Jobs:"
curl -s "$API/api/jobs" \
-H "Authorization: Bearer $TOKEN" | jq
echo -e "\n=== Test Complete ==="Run tests:
chmod +x test.sh
./test.sh1. Register and Login:
# Register
curl -X POST http://localhost:8080/api/auth/register \
-H "Content-Type: application/json" \
-d '{"email":"alice@wisc.edu","password":"password123","group_id":1}'
# Login and save token
export TOKEN=$(curl -s -X POST http://localhost:8080/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"alice@wisc.edu","password":"password123"}' \
| grep -o '"token":"[^"]*' | cut -d'"' -f4)2. Submit and Monitor Jobs:
# Submit job
curl -X POST http://localhost:8080/api/jobs \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"script": "python train.py",
"cpu_cores": 8,
"memory_gb": 32,
"gpu_count": 1,
"priority": 5
}'
# List all jobs
curl http://localhost:8080/api/jobs \
-H "Authorization: Bearer $TOKEN"
# Get specific job
curl http://localhost:8080/api/jobs/1 \
-H "Authorization: Bearer $TOKEN"
# Filter by status
curl "http://localhost:8080/api/jobs?status=running" \
-H "Authorization: Bearer $TOKEN"research-compute-queue/
βββ cmd/
β βββ server/
β βββ main.go # Application entry point
βββ internal/
β βββ api/
β β βββ handlers/ # HTTP request handlers
β β β βββ auth.go # Registration & login
β β β βββ jobs.go # Job management
β β β βββ health.go # Health check
β β βββ middleware/ # HTTP middleware
β β β βββ auth.go # JWT validation
β β β βββ logging.go # Request logging
β β βββ router.go # Route definitions
β βββ auth/
β β βββ jwt.go # JWT token generation/validation
β βββ models/ # Data structures
β β βββ user.go # User & Group models
β β βββ job.go # Job models
β βββ database/ # Database operations
β β βββ postgres.go # PostgreSQL connection
β βββ scheduler/ # Job scheduling logic
β β βββ scheduler.go # Main scheduler loop
β β βββ priority.go # Priority calculation
β β βββ matcher.go # Resource matching
β β βββ executor.go # Job execution
β βββ config/
β βββ config.go # Configuration loading
βββ scripts/
β βββ setup_db.sql # Database schema
βββ .env # Environment variables (not committed)
βββ .env.example # Example environment config
βββ .gitignore # Git ignore rules
βββ go.mod # Go dependencies
βββ go.sum # Go dependency checksums
βββ LICENSE # MIT License
βββ README.md # This file
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | Required |
PORT |
API server port | 8080 |
ENVIRONMENT |
Environment mode (development, production) |
development |
JWT_SECRET |
Secret key for JWT signing | Required |
JWT_EXPIRY_HOURS |
JWT token validity duration | 24 |
SCHEDULER_INTERVAL_SECONDS |
How often scheduler runs | 30 |
MAX_CONCURRENT_JOBS |
Max simultaneous jobs | 10 |
LOG_DIRECTORY |
Directory for job logs | ./logs |
OUTPUT_DIRECTORY |
Directory for job outputs | ./output |
users - User accounts with authentication
- id: Primary key
- email: Unique email address
- password_hash: bcrypt hashed password
- group_id: Foreign key to groups
- is_admin: Admin flaggroups - Research groups with resource quotas
- id: Primary key
- name: Group name
- cpu_quota: Monthly CPU hour quota
- priority: Base group priority (1-10)jobs - Compute jobs
- id: Primary key
- user_id, group_id: Foreign keys
- script: Command to execute
- cpu_cores, memory_gb, gpu_count: Resource requirements
- status: pending/running/completed/failed/cancelled
- priority: Job priority (1-10)
- submitted_at, started_at, completed_at: Timestampsworkers - Compute nodes
- id: Primary key
- hostname: Worker identifier
- cpu_cores, memory_gb, gpu_count: Available resources
- status: idle/busy/offlineusage_logs - Resource usage tracking for fair-share
- group_id: Foreign key to groups
- job_id: Foreign key to jobs
- cpu_hours_used: Calculated CPU hours
- logged_at: Timestamp- Job Dependencies - DAG-based workflow execution
- Queue Viewing Endpoints - See pending jobs and estimated wait times
- Admin Dashboard API - System-wide statistics and management
- WebSocket Support - Real-time log streaming
- Redis Integration - Improved queue performance and caching
- Multi-node Workers - Actual distributed execution
- Email Notifications - Notify users on job completion
- Web UI - React frontend for visualization
- S3 Integration - Store outputs in cloud storage
- Prometheus Metrics - Export metrics for monitoring
- Rate Limiting - API request throttling
- Audit Logging - Track all API actions
This project demonstrates proficiency in:
- RESTful API design principles
- HTTP methods, status codes, and error handling
- Request validation and input sanitization
- Middleware patterns (authentication, logging)
- Relational database design and normalization
- Complex SQL queries with JOINs and aggregations
- Transactions and ACID properties
- Database indexing for performance
- JWT token-based authentication
- Password hashing with bcrypt
- Authorization and access control
- Secure secret management
- Job scheduling algorithms
- Resource allocation and matching
- Fair-share scheduling
- Concurrent programming with goroutines
- Docker containerization
- Environment-based configuration
- Graceful shutdown handling
- Logging and monitoring
- Project organization and modularity
- Error handling patterns
- Testing strategies
- Version control with Git
This is a portfolio/learning project, but feedback and suggestions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Samik Kundu
- π University of Wisconsin-Madison - Computer Science & Data Science
- πΌ Infrastructure Engineer Intern @ Ripple Labs (Summer 2025)
- π LinkedIn: samik-kundu
- π§ Email: skundu2448@gmail.com
- π GitHub: @samik-k21
- Inspiration: Enterprise job schedulers like Slurm, PBS Pro, and Kubernetes
- Learning Resources: Go documentation, PostgreSQL docs, and various software engineering blogs
- Purpose: Built during winter break 2025 as a hands-on learning project to deepen understanding of APIs, distributed systems, and infrastructure software
If you're a recruiter or developer interested in this project:
- Issues: Open an issue on GitHub
- Email: skundu2448@gmail.com
- LinkedIn: Feel free to connect and message me
β If you find this project interesting, please consider starring it on GitHub!