Skip to content

Latest commit

 

History

History
705 lines (496 loc) · 19.5 KB

File metadata and controls

705 lines (496 loc) · 19.5 KB

Tutorials

Step-by-step guides for common orchestrator workflows.


Tutorial 1: Setting up orchestrator with a MySQL topology

This tutorial walks you through setting up orchestrator to manage an existing MySQL master-replica topology.

What you will need

  • A running MySQL master with one or more replicas (MySQL 5.7+ or 8.0+)
  • Go 1.25+ installed
  • Network access from the orchestrator host to all MySQL instances on port 3306

Step 1: Build orchestrator

git clone https://github.com/proxysql/orchestrator.git
cd orchestrator
go build -o bin/orchestrator ./go/cmd/orchestrator

Step 2: Create a MySQL user for orchestrator

On your MySQL master (this will replicate to all replicas automatically):

CREATE USER 'orc_topology'@'orchestrator-host' IDENTIFIED BY 'a_secure_password';
GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orc_topology'@'orchestrator-host';

Replace orchestrator-host with the hostname or IP of the machine running orchestrator. Use % for any host.

Step 3: Create a MySQL backend database

For production use, orchestrator should store its data in MySQL rather than SQLite. On a MySQL instance (can be the same master, or a separate server):

CREATE DATABASE orchestrator;
CREATE USER 'orc_server'@'localhost' IDENTIFIED BY 'another_secure_password';
GRANT ALL ON orchestrator.* TO 'orc_server'@'localhost';

Step 4: Write the configuration file

Create orchestrator.conf.json:

{
  "Debug": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "orc_topology",
  "MySQLTopologyPassword": "a_secure_password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orc_server",
  "MySQLOrchestratorPassword": "another_secure_password",
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "ReasonableReplicationLagSeconds": 10,
  "RecoverMasterClusterFilters": ["*"],
  "RecoverIntermediateMasterClusterFilters": ["*"],
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600
}

Step 5: Start orchestrator

bin/orchestrator -config orchestrator.conf.json http

Step 6: Discover the topology

curl http://localhost:3000/api/discover/your-master-host/3306

Wait a few seconds for orchestrator to crawl the replicas, then verify:

curl -s http://localhost:3000/api/topology/your-master-host/3306

You should see your full replication tree printed as indented text.

Step 7: Verify in the web UI

Open http://localhost:3000 in your browser. Click on Clusters in the navigation to see your topology visualized as a tree.

Step 8: Test a topology operation

Move a replica to a different position (dry run with the API):

# List replicas of the master
curl -s http://localhost:3000/api/instance-replicas/your-master-host/3306

You now have a fully operational orchestrator instance managing your MySQL topology.


Tutorial 2: Configuring ProxySQL failover hooks

This tutorial sets up orchestrator to automatically update ProxySQL hostgroups during master failover, so your application traffic is rerouted without any custom scripts.

Prerequisites

  • A working orchestrator setup (see Tutorial 1)
  • ProxySQL installed and running with the Admin interface accessible
  • Your MySQL servers already configured as backends in ProxySQL

Step 1: Verify ProxySQL Admin access

mysql -h 127.0.0.1 -P 6032 -u admin -padmin -e "SELECT * FROM runtime_mysql_servers;"

You should see your MySQL servers listed with their hostgroups.

Step 2: Note your hostgroup IDs

Identify which hostgroup ID is used for writers and which for readers:

mysql -h 127.0.0.1 -P 6032 -u admin -padmin \
  -e "SELECT hostgroup_id, hostname, port, status FROM runtime_mysql_servers;"

For example, if writers are in hostgroup 10 and readers in hostgroup 20, you will use those values below.

Step 3: Add ProxySQL settings to orchestrator config

Add these fields to your orchestrator.conf.json:

{
  "ProxySQLAdminAddress": "127.0.0.1",
  "ProxySQLAdminPort": 6032,
  "ProxySQLAdminUser": "admin",
  "ProxySQLAdminPassword": "admin",
  "ProxySQLWriterHostgroup": 10,
  "ProxySQLReaderHostgroup": 20,
  "ProxySQLPreFailoverAction": "offline_soft"
}
Field Description
ProxySQLWriterHostgroup The hostgroup ID where the current master lives. Must be > 0 to enable hooks.
ProxySQLReaderHostgroup The hostgroup ID for read replicas. Optional but recommended.
ProxySQLPreFailoverAction What to do with the old master before failover: offline_soft (drain connections), weight_zero, or none.

Step 4: Restart orchestrator

# Stop the running instance (Ctrl+C), then:
bin/orchestrator -config orchestrator.conf.json http

Step 5: Verify ProxySQL connectivity

curl -s http://localhost:3000/api/proxysql/servers | python3 -m json.tool

You should see your ProxySQL server list returned as JSON.

Step 6: Understand the failover flow

When orchestrator detects a dead master and performs recovery:

  1. Pre-failover: The old master is set to OFFLINE_SOFT in ProxySQL (no new connections)
  2. Topology recovery: Orchestrator promotes a replica to be the new master
  3. Post-failover: The new master is added to the writer hostgroup; the old master is removed
  4. ProxySQL applies changes immediately via LOAD MYSQL SERVERS TO RUNTIME

ProxySQL hooks are non-blocking: if ProxySQL is unreachable, the MySQL failover still proceeds.

Step 7: Test with a graceful takeover

To verify everything works without an actual failure, perform a graceful master takeover:

# Identify the current master
curl -s http://localhost:3000/api/clusters

# Perform a graceful takeover (promotes a replica, demotes the master)
curl -s http://localhost:3000/api/graceful-master-takeover/your-cluster-alias/your-new-master-host/3306

Check ProxySQL to confirm the hostgroups updated:

mysql -h 127.0.0.1 -P 6032 -u admin -padmin \
  -e "SELECT hostgroup_id, hostname, port, status FROM runtime_mysql_servers;"

For more details, see the full ProxySQL hooks documentation.


Tutorial 3: Monitoring orchestrator with Prometheus

This tutorial sets up Prometheus to scrape orchestrator metrics and shows useful queries for alerting.

Prerequisites

Step 1: Enable Prometheus metrics in orchestrator

Prometheus metrics are enabled by default. Verify by adding this to your orchestrator.conf.json (or confirm it is not explicitly disabled):

{
  "PrometheusEnabled": true
}

Restart orchestrator if you changed the config.

Step 2: Verify the metrics endpoint

curl -s http://localhost:3000/metrics | head -20

You should see Prometheus-formatted metrics output.

Step 3: Configure Prometheus to scrape orchestrator

Add a scrape job to your prometheus.yml:

scrape_configs:
  - job_name: orchestrator
    static_configs:
      - targets: ['orchestrator-host:3000']
    metrics_path: /metrics
    scrape_interval: 15s

Replace orchestrator-host with the actual hostname or IP. Reload Prometheus:

kill -HUP $(pgrep prometheus)
# or restart the Prometheus service

Step 4: Verify in Prometheus

Open the Prometheus UI (typically http://prometheus-host:9090) and query:

orchestrator_instances_total

You should see the number of MySQL instances orchestrator is managing.

Step 5: Useful queries

Total known instances and clusters:

orchestrator_instances_total
orchestrator_clusters_total

Discovery error rate (over last 5 minutes):

rate(orchestrator_discovery_errors_total[5m])

Recovery operations by type:

sum by (type) (orchestrator_recoveries_total)

Recovery duration (p95 over last hour):

histogram_quantile(0.95, rate(orchestrator_recovery_duration_seconds_bucket[1h]))

Step 6: Set up alerting rules

Create an alerting rule file (e.g., orchestrator-alerts.yml):

groups:
  - name: orchestrator
    rules:
      - alert: OrchestratorHighDiscoveryErrors
        expr: rate(orchestrator_discovery_errors_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Orchestrator has a high discovery error rate"
          description: "More than 0.1 discovery errors/second for the last 10 minutes."

      - alert: OrchestratorRecoveryOccurred
        expr: increase(orchestrator_recoveries_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator performed a recovery"
          description: "A failover or recovery event occurred in the last 5 minutes."

      - alert: OrchestratorDown
        expr: up{job="orchestrator"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator is unreachable"

Reference this file in your prometheus.yml:

rule_files:
  - orchestrator-alerts.yml

Step 7: Kubernetes health endpoints

If running orchestrator in Kubernetes, use the built-in health check endpoints for liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /api/status
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /api/status
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

For the full list of metrics, see the Observability documentation.


Tutorial 4: Using the API v2

This tutorial introduces the v2 REST API, which provides structured JSON responses and proper HTTP status codes.

Prerequisites

  • A running orchestrator instance with at least one discovered topology

Step 1: Understand the response format

All v2 endpoints return a consistent JSON envelope:

{
  "status": "ok",
  "data": { ... }
}

On errors:

{
  "status": "error",
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable description"
  }
}

HTTP status codes (200, 400, 404, 500, 503) are used correctly, unlike the v1 API which always returns 200.

Step 2: List all clusters

curl -s http://localhost:3000/api/v2/clusters | python3 -m json.tool

Example response:

{
  "status": "ok",
  "data": [
    {
      "clusterName": "master.example.com:3306",
      "clusterAlias": "production",
      "instanceCount": 5
    }
  ]
}

Step 3: Get cluster details

curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306 | python3 -m json.tool

Step 4: List instances in a cluster

curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/instances | python3 -m json.tool

Step 5: Get a specific instance

curl -s http://localhost:3000/api/v2/instances/replica1.example.com/3306 | python3 -m json.tool

Step 6: View the topology

curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/topology | python3 -m json.tool

Step 7: Check orchestrator health

curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/api/v2/status

A 200 response means the node is healthy. A 500 response means it is not.

Step 8: View recent recoveries

# All recent recoveries
curl -s http://localhost:3000/api/v2/recoveries | python3 -m json.tool

# Filter by cluster
curl -s "http://localhost:3000/api/v2/recoveries?cluster=master.example.com:3306" | python3 -m json.tool

# Active recoveries only
curl -s http://localhost:3000/api/v2/recoveries/active | python3 -m json.tool

Step 9: Query ProxySQL servers via API v2

If ProxySQL hooks are configured:

# All servers
curl -s http://localhost:3000/api/v2/proxysql/servers | python3 -m json.tool

If ProxySQL is not configured, you will receive a 503 status:

{
  "status": "error",
  "error": {
    "code": "PROXYSQL_NOT_CONFIGURED",
    "message": "ProxySQL is not configured"
  }
}

Step 10: Scripting with the v2 API

The structured responses make scripting straightforward. Example: get all instance hostnames in a cluster using jq:

curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/instances \
  | jq -r '.data[].Key.Hostname'

Check if any recoveries happened in the last hour:

STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/api/v2/recoveries/active)
if [ "$STATUS" = "200" ]; then
  ACTIVE=$(curl -s http://localhost:3000/api/v2/recoveries/active | jq '.data | length')
  echo "Active recoveries: $ACTIVE"
fi

For the full endpoint reference, see the API v2 documentation. An OpenAPI 3.0 specification is also available for client generation.


Tutorial 5: Setting up orchestrator with PostgreSQL streaming replication

This tutorial walks you through configuring orchestrator to manage a PostgreSQL streaming replication topology. Orchestrator discovers PostgreSQL primaries and standbys, monitors replication health, and can perform automated failover when a primary fails.

What you will need

  • PostgreSQL 12+ primary with one or more streaming replication standbys already configured
  • Go 1.25+ installed (for building from source)
  • Network access from the orchestrator host to all PostgreSQL instances on port 5432

Step 1: Build orchestrator

git clone https://github.com/proxysql/orchestrator.git
cd orchestrator
go build -o bin/orchestrator ./go/cmd/orchestrator

Step 2: Create an orchestrator user on PostgreSQL

On your PostgreSQL primary (this user must exist on all instances -- primary and standbys):

CREATE USER orchestrator WITH PASSWORD 'orch_pass';
GRANT pg_monitor TO orchestrator;

The pg_monitor role grants read access to pg_stat_replication, pg_stat_wal_receiver, and other monitoring views that orchestrator needs for discovery.

Note: If you are using PostgreSQL 9.6 (not recommended), you need to grant SELECT on the individual monitoring views instead of using pg_monitor.

Step 3: Ensure pg_hba.conf allows connections

On each PostgreSQL instance, ensure pg_hba.conf allows connections from the orchestrator host:

# TYPE  DATABASE    USER           ADDRESS              METHOD
host    all         orchestrator   orchestrator-host/32  md5

Reload PostgreSQL after editing:

psql -c "SELECT pg_reload_conf();"

Step 4: Configure orchestrator for PostgreSQL

Create orchestrator.conf.json:

{
  "Debug": true,
  "ListenAddress": ":3000",
  "ProviderType": "postgresql",
  "PostgreSQLTopologyUser": "orchestrator",
  "PostgreSQLTopologyPassword": "orch_pass",
  "PostgreSQLSSLMode": "require",
  "BackendDB": "sqlite",
  "SQLite3DataFile": "/tmp/orchestrator.sqlite3",
  "DefaultInstancePort": 5432,
  "InstancePollSeconds": 5,
  "RecoverMasterClusterFilters": ["*"],
  "RecoverIntermediateMasterClusterFilters": ["*"],
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600
}

Key fields explained:

Field Purpose
ProviderType Set to "postgresql" to enable PostgreSQL mode. Default is "mysql".
PostgreSQLTopologyUser / Password Credentials orchestrator uses to connect to your PostgreSQL instances.
PostgreSQLSSLMode SSL mode for PostgreSQL connections: disable, require, verify-ca, or verify-full.
DefaultInstancePort Set to 5432 for PostgreSQL (default is 3306 for MySQL).

Step 5: Start orchestrator

bin/orchestrator -config orchestrator.conf.json http

You should see output indicating the service has started and is listening on port 3000.

Step 6: Discover your PostgreSQL topology

Tell orchestrator about your PostgreSQL primary. Replace pg-primary with the actual hostname or IP:

curl http://localhost:3000/api/discover/pg-primary/5432

Expected output:

{
  "Key": {"Hostname": "pg-primary", "Port": 5432},
  "Uptime": 1,
  "FlavorName": "PostgreSQL",
  "Version": "16.2",
  "ReadOnly": false
}

Orchestrator connects to the primary, queries pg_stat_replication to discover connected standbys, and recursively probes each standby.

Step 7: Verify discovery in the web UI

Open your browser to http://localhost:3000. You should see your PostgreSQL replication topology visualized as a tree:

  • The primary node at the top (read-only: false)
  • Standby nodes underneath (read-only: true)
  • Replication lag displayed for each standby

Step 8: Verify via the API

List discovered clusters:

curl -s http://localhost:3000/api/clusters

View the topology:

curl -s http://localhost:3000/api/topology/pg-primary/5432

Check replication analysis (should show NoProblem if everything is healthy):

curl -s http://localhost:3000/api/replication-analysis

Example healthy output:

[]

An empty array means no problems detected.

Step 9: Test graceful failover

To verify failover works, you can simulate a primary failure by stopping PostgreSQL on the primary:

# On the primary host:
pg_ctl stop -D /var/lib/postgresql/16/main -m fast

Within a few seconds, orchestrator will detect the DeadPrimary condition and, if automated recovery is enabled, will:

  1. Select the best standby for promotion (lowest lag, most up-to-date WAL position)
  2. Call pg_promote() on the selected standby
  3. Reconfigure remaining standbys to replicate from the new primary via ALTER SYSTEM SET primary_conninfo and pg_reload_conf()

Monitor the recovery in the web UI or via the API:

curl -s http://localhost:3000/api/replication-analysis

After recovery completes:

curl -s http://localhost:3000/api/topology/new-primary-host/5432

You should see the new primary at the top with the remaining standbys underneath.

Step 10: Verify the recovered topology

Check that all standbys are replicating from the new primary:

# On the new primary:
psql -c "SELECT client_hostname, state, sent_lsn, replay_lsn FROM pg_stat_replication;"

Check orchestrator's view:

curl -s http://localhost:3000/api/v2/clusters | python3 -m json.tool

Differences from MySQL mode

When running in PostgreSQL mode, be aware of these differences:

  • No intermediate masters. PostgreSQL streaming replication does not support cascading replication in the same way as MySQL. Orchestrator treats all standbys as direct replicas of the primary.
  • No GTID/Pseudo-GTID. PostgreSQL uses WAL (Write-Ahead Log) positions (LSN) instead of GTIDs. Orchestrator maps LSN to its internal binlog coordinate system.
  • Promotion uses pg_promote(). Instead of STOP SLAVE / RESET SLAVE / CHANGE MASTER, orchestrator calls pg_promote() on the standby to make it a primary.
  • Standby reconfiguration uses ALTER SYSTEM. To repoint a standby to a new primary, orchestrator updates primary_conninfo via ALTER SYSTEM and reloads the configuration.
  • ProxySQL integration is not supported in PostgreSQL mode. Use PgBouncer or another PostgreSQL-aware connection pooler.

Next steps