AddA demonstrates decentralized AI inference using the Akash Network. It showcases PyTorch services distributed across multiple GPU providers for text-to-image generation.
Diffusion models are generative models that learn how to create new data (like images) by reversing a gradual noising process.
Imagine taking a beautiful photo and adding random noise again and again until it's pure static. The model is trained to do the reverse — start from random noise and gradually remove noise to reveal a coherent image.
That's diffusion: noise → clean image, one denoising step at a time.
To make images follow a prompt ("a neon samurai in the rain"), the model learns to use text embeddings as guidance. A text encoder turns the prompt into a vector representation. The denoiser (the UNet) then uses that embedding at every step to steer the denoising process.
Stable Diffusion does its denoising not on the full image but in a compressed latent space — roughly 8× smaller in each dimension. This is achieved using a Variational Autoencoder (VAE).
- Text Encoder (CLIP): Turns words → numbers.
- UNet Denoiser: The heavy PyTorch model that removes noise step by step.
- VAE: Translates between latent and pixel space (used mainly at the start and end).
Because it works in this latent space, Stable Diffusion is fast and GPU-efficient, making it perfect for decentralized compute networks like Akash.
We decompose the Stable Diffusion pipeline into three services, each running independently on Akash GPU nodes.
Prompt ➜ [Text Encoder] ➜ (Embeddings) ➜ [UNet Denoiser] ➜ (Latent) ➜ [VAE Decoder] ➜ Image
```text
#### 1. svc-encoder (Text Encoder)
- Runs CLIP to convert text → embeddings (size ≈ 77×768).
- Lightweight; can run on CPU.
- Stores embeddings in MinIO and returns an embedding_id.
#### 2. svc-unet (Denoiser)
- The heavy-lift GPU stage.
- Starts with random latent noise; runs ~4–8 diffusion steps guided by text embedding.
- Produces a denoised latent tensor and returns a latent_id.
#### 3. svc-decoder (VAE)
- Converts latent → final RGB image.
- Saves to MinIO and returns a public image_url.
#### 4. Coordinator
- The orchestrator calling the three services in sequence.
- Manages retries, timing metrics, and the public /generate API.
- Powers the booth UI (Gradio or React) that visualizes the data flow.
### ⚙️ Why This Architecture Works Perfectly for Akash
| Goal | How It's Achieved |
|------|-------------------|
| Showcase PyTorch | All inference runs on torch using official Diffusers models. |
| Highlight Decentralization | Each model stage runs on an independent Akash provider. |
| Visual Appeal | The UI shows the image emerging as nodes pass data along. |
| Performance | UNet (GPU-heavy) isolated on its own node; others on lighter compute. |
| Reliability | Coordinator can retry failed nodes or fall back to a single-node monolith. |
### 🧠 A Quick Intuitive Recap
- **Text Encoder**: Understands what to draw.
- **UNet**: Actually draws it — like painting the picture from noisy fog.
- **VAE**: Turns that internal sketch into a real image.
Each is a self-contained PyTorch service, meaning they can live on different providers, yet work together to produce one coherent output.
That's the power of distributed AI inference on Akash — decentralized GPUs acting as a single intelligent system.
## Architecture
Right now this is a monolith but this may be evolved over time to be microservices
**Resource Optimization**: In the current monolith version, all three diffusion services are identical copies running the complete diffusion pipeline, so they all require GPU nodes. However, in the future microservices version, only the UNet denoiser requires heavy GPU compute - the Text Encoder and VAE Decoder can run on CPU nodes or lower-end GPU nodes, reducing costs while maintaining performance.
```text
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Akash Node A │ │ Akash Node B │ │ Akash Node C │
│ (Provider 1) │ │ (Provider 2) │ │ (Provider 3) │
│ │ │ │ │ │
│ Diffusion │ │ Diffusion │ │ Diffusion │
│ Service │ │ Service │ │ Service │
│ (GPU) │ │ (GPU) │ │ (GPU) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
└───────────────────── ┼───────────────────────┘
│
┌─────────────────┐
│ Coordinator │
│ (Load Balancer│
│ & API) │
└─────────────────┘
│
┌─────────────────┐
│ Gradio UI │
│ (Local/Cloud) │
└─────────────────┘
```text
### Future Vision (Microservices)
```text
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Akash Node A │ │ Akash Node B │ │ Akash Node C │
│ (Provider 1) │ │ (Provider 2) │ │ (Provider 3) │
│ │ │ │ │ │
│ Text Encoder │ │ UNet Denoiser │ │ VAE Decoder │
│ (CPU/GPU) │ │ (GPU Heavy) │ │ (CPU/GPU) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
└───────────────────── ┼───────────────────────┘
│
┌─────────────────┐
│ Coordinator │
│ (Orchestrator)│
└─────────────────┘
```text
## Data Transfer Strategy
**Why Data Transfer is Necessary**: In a distributed microservices architecture, each service runs independently and needs to pass large data (tensors, embeddings, images) between them. These data objects are too large for simple HTTP parameters and need persistent storage for reliability.
**Why MinIO is Ideal**: MinIO provides distributed object storage that's perfect for storing large binary data (tensors, images) between microservices. It offers high performance, S3-compatible API, and handles the complexity of distributed storage that would otherwise require custom solutions.
### Current Approach: HTTP + Base64
We use **HTTP with Base64 encoding** for data transfer between services:
- **Images**: Generated as PNG, encoded to base64, sent via HTTP
- **Storage**: Images saved to local `images/` directory for persistence
- **Communication**: Direct HTTP calls between coordinator and diffusion services
### Why Not MinIO (Yet)?
**Original Plan**: Decompose into microservices (Text Encoder → UNet → VAE) with MinIO for tensor storage.
**Current Reality**: Single diffusion service with full pipeline, so no inter-service data transfer needed.
**Trade-offs**:
- ✅ **Simpler**: No external dependencies, faster development
- ✅ **Good for demos**: Shows core PyTorch functionality
- ❌ **Larger payloads**: Base64 adds ~33% overhead
- ❌ **Memory usage**: Full images in memory
### Future Evolution
When we decompose to **true microservices** on Akash:
- **MinIO will be needed** for tensor storage between services
- **Services on different providers** can't share memory
- **Reliability** requires persistent storage for service restarts
## Services
### 1. Diffusion Service (`services/diffusion/`)
- **Purpose**: Complete Stable Diffusion pipeline in a single service
- **Technology**: FastAPI + PyTorch + Diffusers
- **Deployment**: Akash GPU nodes
- **Features**:
- Text-to-image generation (full pipeline)
- GPU acceleration (CUDA/MPS/CPU)
- Health monitoring
- Provider identification
- Image saving to filesystem
### 2. Coordinator Service (`coordinator/`)
- **Purpose**: Load balancer and API gateway
- **Technology**: FastAPI + Requests
- **Deployment**: Akash CPU nodes or local
- **Features**:
- Load balancing between diffusion services
- Request orchestration and retries
- Health monitoring
- Status tracking and metrics
### 3. Gradio UI (`coordinator/gradio_ui.py`)
- **Purpose**: Interactive web interface for demos
- **Technology**: Gradio + FastAPI
- **Deployment**: Local development or Akash
- **Features**:
- Text prompt input with presets
- Real-time image display
- Generation progress tracking
- Service status monitoring
- Generated images saved to `images/` directory
## Quick Start
### Local Development (M1 MacBook Compatible)
**Option A: Quick UI Test (Fast)**
```textbash
# Start with placeholder images for quick testing
./scripts/quick_ui_test.sh
```text
- ⚡ Fast startup (~30 seconds)
- 🎨 Placeholder images for UI testing
- ✅ Full system validation
**Option B: Real Image Generation (Complete)**
```textbash
# Start with actual Stable Diffusion model
./scripts/test_complete.sh
```text
- ⏳ Downloads ~2GB model (2-3 minutes on M1)
- ✅ Real AI image generation
- ✅ M1 GPU acceleration (MPS)
- ✅ Complete system test
**Access the UI:**
- Open browser to: http://localhost:7860
- Enter prompts and generate images
- Images saved to `images/` directory
**Performance on M1:**
- First generation: ~30-60 seconds
- Subsequent generations: ~10-20 seconds
### Akash Deployment
1. **Build and Push Images**
```textbash
# Build services
docker build -t your-registry/adda-diffusion services/diffusion/
docker build -t your-registry/adda-coordinator coordinator/
# Push to registry
docker push your-registry/adda-diffusion
docker push your-registry/adda-coordinator
```text
2. **Deploy to Akash**
```textbash
# Update SDL files with your image URLs
# Deploy diffusion services to GPU providers
akash tx deployment create infra/sdl-diffusion.yaml --from your-key
# Deploy coordinator
akash tx deployment create infra/sdl-coordinator.yaml --from your-key
```text
3. **Access Your Deployment**
- Get service URLs from Akash console
- Update `COORDINATOR_URL` in Gradio UI
- Test with: `curl -X POST $COORDINATOR_URL/generate -d '{"prompt":"test"}'`
## API Endpoints
### Diffusion Service
- `GET /healthz` - Health check
- `POST /generate` - Generate image
- `GET /info` - Service information
### Coordinator Service
- `GET /healthz` - Health check
- `GET /services` - List all services
- `POST /generate` - Generate image (load balanced)
- `GET /status/{request_id}` - Check generation status
- `GET /info` - Coordinator information
## Configuration
### Environment Variables
**Diffusion Service:**
- `PROVIDER_ID` - Provider identifier
**Coordinator Service:**
- `SERVICE_1_URL` - URL of first diffusion service
- `SERVICE_2_URL` - URL of second diffusion service
- `SERVICE_3_URL` - URL of third diffusion service
### Akash SDL Files
- `infra/sdl-diffusion.yaml` - Diffusion service deployment
- `infra/sdl-coordinator.yaml` - Coordinator deployment
- `infra/sdl-monolith.yaml` - Fallback single-node deployment
## Troubleshooting
### Common Issues
1. **Service not responding**: Check health endpoints
2. **Generation timeout**: Increase timeout values
3. **GPU not available**: Check CUDA installation
4. **Network issues**: Verify service URLs
### Debug Commands
```textbash
# Check service health
curl http://service-url/healthz
# Test generation
curl -X POST http://coordinator-url/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "test image"}'
# Check status
curl http://coordinator-url/status/request-id
```text
## License
This project is for demonstration purposes. Please check model licenses before production use.
## Roadmap
See [roadmap.md](roadmap.md) for detailed development phases and milestones.
## Contributing
This is a demo project for the PyTorch conference. See git workflow rules in [.cursor/rules/git_workflow.md](.cursor/rules/git_workflow.md).