Skip to content

FLock-io/FL-Alliance-Client

Repository files navigation

FLock FL Alliance Client

FLock FL Alliance Client

Python 3.11+ MLX Docker

FL Alliance is a decentralized federated learning protocol where multiple participants collaboratively train a shared model — without ever exposing their private data. Participants stake tokens, train on local datasets, and are rewarded or slashed based on the quality of their contributions, all enforced by on-chain smart contracts.

This repository is the production client for FL Alliance. It handles the full lifecycle — staking, model download, local training, parameter upload, voting, aggregation, and reward claiming — so you can participate with a single command.

Key Features

  • Four operating modes — on-chain testnet, local chain dev, fully offline, and chainless pure FL
  • Two runtime backends — Docker container or local Python process; repository defaults to runtime.mode=docker, OCM addon deployments typically override to local (direct client, no FLocKit sidecar)
  • Seal encryption — optional end-to-end encryption of model parameters via Mysten Labs' Seal
  • LAN-ready — run multi-client simulations on a single machine or across a local network
  • Cross-platform — Linux, macOS (Apple Silicon / Intel), and Windows 11 + WSL2; see Runtime Modes for the support matrix
  • Concurrent backend runs — the FastAPI backend supports multiple simultaneous client runs with per-run port and environment isolation
  • Structured logging — rotating file logs with full source tracing for production debugging; subprocess (FLockit) errors and warnings are forwarded to the parent log in real time so model-side failures surface without a manual tail -f
  • Operator-friendly failure modes — SIGHUP / SIGTERM / SIGINT are caught and logged before exit (no more silent deaths from SSH disconnects), and every long wait emits an INFO heartbeat naming the polled URL and the subprocess log path

Table of Contents


Prerequisites

Requirement When needed
Python >= 3.11 Always
uv (recommended) or pip Always
Docker Only for docker runtime mode (default)
$FLOCK tokens (get whitelisted (TBD)) Online mode only
Base Sepolia ETH (Alchemy Faucet) Online mode only

Local simulation and pure FL modes require no tokens, no ETH, and no internet (after initial dependency install).


Quick Start

The default quick start is the online testnet (on-chain) flow.

1. Clone and install

git clone https://github.com/FLock-io/FL-Alliance-Client.git
cd FL-Alliance-Client

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

uv sync is the recommended path for the repository-managed environment. requirements.txt is kept as a compatibility install path and may pull a heavier dependency set for model/runtime workflows.

2. Configure

cp .env.onchain.example .env
# Edit .env and set:
#   PRIVATE_KEY=<your wallet private key>
#   BLOCKCHAIN_RPC=<Base Sepolia RPC URL>   # WEB3_RPC_URL is also supported
#   TOKEN_ADDRESS=<FlockToken address>
#   TASK_ADDRESS=<FlockTask address>
# Optional but recommended on testnet/mainnet:
#   EXPECTED_CHAIN_ID=84532                  # Base Sepolia; refuses to start on mismatch
#   BLOCKCHAIN_TX_RECEIPT_TIMEOUT=180        # Seconds for tx receipt (default 120)
#   FLOCK_CONTRACTS_FILE=/path/to/contracts.json   # Highest-priority contracts source
# Optional:
#   HF_TOKEN=<token for gated models>
# Optional — bump for LLM tasks on a cold cache (HF download + venv install):
#   PROCESS_STARTUP_TIMEOUT=7200             # Seconds before model startup is declared failed (default 1800)
#   PROCESS_RESPONSE_TIMEOUT=7200            # Seconds per train / evaluate / aggregate SDK call (default 3600)

3. Run a client

python main.py -c config/conf.yaml \
  --task-address <TASK_ADDRESS> \
  --dataset <DATASET_PATH> \
  --hf-token <HF_TOKEN> \
  --gpu

Use a custom mounted env file (for example in Kubernetes):

python main.py -c config/conf.yaml --env-file /data/.env

Example:

python main.py -c config/conf.yaml \
  --task-address 0x47B0397C6ae306002788D093b29bcD2EDAd19924 \
  --dataset data/asr_sarawakmalay_whisper_format_client_ids.json \
  --hf-token $HF_TOKEN \
  --gpu

Long-running training: wrap the command in tmux / nohup / systemd so the client survives SSH disconnects. The client now installs SIGHUP / SIGTERM handlers that log the signal name before exit, but SIGKILL (OOM-killer) still terminates the process silently — a session manager is the only reliable defence.

4. Scale to multiple clients (optional)

# Use a different PRIVATE_KEY and runtime.port per process
python main.py -c config/conf.yaml \
  --task-address <TASK_ADDRESS> \
  --dataset <DATASET_PATH> \
  --hf-token <HF_TOKEN> \
  --gpu \
  --override runtime.port=<UNIQUE_PORT>

That's it. You are now running on Base Sepolia with incentive-enabled FL Alliance flow.

Container image publishing

Recommended: publish both latest and a git-SHA tag, then deploy the SHA tag from flock-addon.

Recommended setup:

export IMAGE_SHA=$(git rev-parse --short=12 HEAD)

Build locally:

make image-build IMAGE_OWNER=ray-ruisun IMAGE_TAG=latest IMAGE_IMMUTABLE_TAG="$IMAGE_SHA"

Inspect the local image:

make image-inspect IMAGE_OWNER=ray-ruisun IMAGE_TAG=latest IMAGE_IMMUTABLE_TAG="$IMAGE_SHA"

Push manually:

make image-login GHCR_USER="$GHCR_USER" GHCR_PAT="$GHCR_PAT"
make image-push IMAGE_OWNER=ray-ruisun IMAGE_TAG=latest IMAGE_IMMUTABLE_TAG="$IMAGE_SHA"

One command publish flow:

make image-publish \
  IMAGE_OWNER=ray-ruisun \
  IMAGE_TAG=latest \
  IMAGE_IMMUTABLE_TAG="$IMAGE_SHA" \
  GHCR_USER="$GHCR_USER" \
  GHCR_PAT="$GHCR_PAT"

If Docker on your machine requires sudo, use:

make image-publish \
  DOCKER='sudo docker' \
  IMAGE_OWNER=ray-ruisun \
  IMAGE_TAG=latest \
  IMAGE_IMMUTABLE_TAG="$IMAGE_SHA" \
  GHCR_USER="$GHCR_USER" \
  GHCR_PAT="$GHCR_PAT"

Print the exact published tags:

make image-print IMAGE_OWNER=ray-ruisun IMAGE_TAG=latest IMAGE_IMMUTABLE_TAG="$IMAGE_SHA"

Recommended handoff to flock-addon:

export IMAGE_TAG=$(git rev-parse --short=12 HEAD)

Automatic publishing:

  • GitHub Actions now publishes to ghcr.io/<repository-owner-lowercase>/fl-alliance-client
  • pushes on main and version tags such as v0.1.0 will publish automatically
  • workflow_dispatch can also publish on demand

Dataset format: DATASET accepts a single file or a directory. main.py first stages every input into a temporary directory by copying (shutil.copytree / shutil.copy2); the runtime backend then exposes that staging directory to the model — Docker via a read-only bind mount at /app/data, and runtime.mode=local via a symlink (falling back to an NTFS junction or a full copy on Windows). See Configuration and Runtime Modes for details.

Prefer local simulation first? Use offline mode:

cp .env.local.example .env
make chain MODEL_DEFINITION_HASH=$(sha256sum model.tar.gz | cut -d' ' -f1)
make sim1 DATASET=data/train.jsonl

On macOS, replace sha256sum with: shasum -a 256 model.tar.gz | cut -d' ' -f1

For all scenarios (testnet, dev mode, offline mode, pure FL, and LAN deployment), see the Run Playbook.


Operating Modes

Mode Chain Storage Internet Config Command
Online (testnet) Base Sepolia S3 Required config/conf.yaml python main.py -c config/conf.yaml ...
Dev (local chain + object storage) Local Anvil S3 Signer / direct S3-compatible + HuggingFace Required config/simulation-online.yaml make dev1
Offline (fully local) Local Anvil Local filesystem Not needed config/simulation.yaml make sim1
Pure FL (chainless) None Local filesystem Not needed config/pure-fl.yaml make pure-fl1

All modes use the same client code — only the configuration differs. Each mode has a dedicated YAML config template and corresponding Makefile targets (dev/sim: up to 20 clients, pure-fl: 3 clients by default).

Choosing a mode:

  • Just exploring? Start with Offline mode (make sim1) — zero external dependencies.
  • Developing with real storage? Use Dev mode (make dev1) — local chain + S3 Signer or direct S3-compatible storage.
  • Running on testnet? Use Online mode (python main.py -c config/conf.yaml ...) — requires $FLOCK tokens and Base Sepolia ETH.
  • No blockchain needed? Use Pure FL mode (make pure-fl1) — coordination via shared files only.

For step-by-step instructions for each mode, see the Run Playbook.


Project Structure

.
├── client/                  # Core FL client runtime and managers
│   ├── contracts/           # Smart contract wrappers and ABIs
│   ├── managers/            # Container, storage, sync, metrics, coordination managers
│   ├── encryption/          # Seal encryption integration
│   └── logging_utils.py     # Centralized logging configuration
├── contracts/               # Solidity contracts and deployment scripts
├── config/                  # Configuration templates (one per mode)
│   ├── conf.yaml            # Online mode (Base Sepolia)
│   ├── simulation-online.yaml # Dev mode: local chain + online storage
│   ├── simulation.yaml      # Offline mode: local chain + local storage
│   └── pure-fl.yaml         # Pure FL mode (chainless)
├── docs/                    # Detailed documentation
├── .env.onchain.example     # .env template for online mode
├── .env.local.example       # .env template for local chain modes
├── main.py                  # Client entry point
├── docker-compose.yml       # Local chain + deployer services
├── Makefile                 # Developer shortcuts
└── output/                  # Runtime logs and task outputs (git-ignored)

Documentation

Document Description
Configuration Config files, env vars, YAML settings, CLI overrides
Run Playbook Step-by-step commands for every scenario
Runtime Modes Docker / local execution backends
Local Chain Simulation Offline and LAN deployment, shared storage setup (NFS/SMB/sshfs)
Pure FL Mode Chainless federated learning without incentive mechanism
Encryption & Storage Seal encryption, S3/Nami/local storage backends
FL Alliance Protocol Protocol deep-dive and smart contract lifecycle
Backend API FastAPI service for runs, metrics, events, artifacts, and task admin

Makefile Parameters

Parameter Default Description
DATASET (required) Path to dataset file or directory
GPU true Enable GPU acceleration (true/false)
CHAIN_HOST localhost Anvil chain host IP (for remote LAN clients)
TOKEN_ADDRESS (auto) FlockToken contract address (auto-detected from $FLOCK_CONTRACTS_FILE, /data/contracts.json, or data/contracts.json — first match wins)
TASK_ADDRESS (auto) FlockTask contract address (auto-detected from the same set as TOKEN_ADDRESS)
MODEL_DEFINITION_HASH (required for make chain) SHA-256 hash of model archive
ROUNDS 10 Number of training rounds
MIN_PARTICIPANTS 3 Minimum participants per round

Development

This project uses uv for Python package management:

uv sync                        # install dependencies
uv run python main.py          # run in project environment
uv add <package>               # add a dependency

Before submitting changes:

make test
uv run python -m compileall main.py client

If pytest fails during startup because an external plugin is auto-loaded by your environment, run:

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run pytest -q

Troubleshooting

Local mode — dependency or module errors

When using runtime.mode: local, the client creates virtual environments in tmp_envs/ for each task. If you see ModuleNotFoundError or similar after updating baseline packages, remove the cache and retry:

rm -rf tmp_envs/

By default, local runtime environments are preserved to speed up restarts. Set FL_KEEP_MODEL_ENV=false to force cleanup on each stop.

Process exited silently after Waiting for model to start...

Almost always one of:

  1. SSH session dropped — the parent received SIGHUP and was killed before any handler could run on legacy builds. Recent builds log the signal name before exit; either way, run inside tmux / nohup / systemd so the client outlives the shell.
  2. OOM-killerSIGKILL cannot be caught. Confirm with sudo dmesg -T | grep -iE 'killed process|out of memory'. Lower batch size, lower model precision, or move to a larger box.
  3. Genuine startup timeout — bump PROCESS_STARTUP_TIMEOUT (default 1800 seconds) and PROCESS_RESPONSE_TIMEOUT (default 3600). Both are env-var overridable; LLM cold-starts (HF download + venv install + GPU load) frequently need 2 hours.

In every case the model subprocess uses start_new_session=True, so it survives parent death — inspect output/task_outputs/process_*.log to see exactly how far it got.


License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.


References

FL Alliance is based on academic research by the FLock team. See the paper: Defending Against Poisoning Attacks in Federated Learning With Blockchain.

About

A client application for running a FL Alliance node.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors