diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..0a1543c5 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,132 @@ +# Quick Start Guide + +Orchestrator is a MySQL high availability and replication management tool that discovers replication topologies, enables refactoring of replica trees, and performs automated or manual failover. It runs as a service with a web UI, HTTP API, and CLI. + +This guide gets you from zero to a running orchestrator instance in under 5 minutes. + +## Prerequisites + +- **Go 1.25+** (for building from source) +- **gcc** (required by the SQLite driver via cgo) +- **MySQL 5.6+ or 8.0+** replication topology to manage (optional for initial setup) +- No external database required -- orchestrator can use a built-in SQLite backend + +## Step 1: Build from source + +```bash +git clone https://github.com/proxysql/orchestrator.git +cd orchestrator +go build -o bin/orchestrator ./go/cmd/orchestrator +``` + +Verify the build: + +```bash +bin/orchestrator --help +``` + +## Step 2: Create a minimal configuration + +Create a file called `orchestrator.conf.json` in the project root: + +```json +{ + "Debug": true, + "ListenAddress": ":3000", + "MySQLTopologyUser": "orc_client_user", + "MySQLTopologyPassword": "orc_client_password", + "MySQLOrchestratorHost": "", + "MySQLOrchestratorPort": 0, + "MySQLOrchestratorDatabase": "", + "BackendDB": "sqlite", + "SQLite3DataFile": "/tmp/orchestrator.sqlite3", + "DefaultInstancePort": 3306, + "DiscoverByShowSlaveHosts": true, + "InstancePollSeconds": 5, + "RecoverMasterClusterFilters": ["*"], + "RecoverIntermediateMasterClusterFilters": ["*"] +} +``` + +**Key fields explained:** + +| Field | Purpose | +|-------|---------| +| `ListenAddress` | HTTP listen address. `:3000` means all interfaces, port 3000. | +| `MySQLTopologyUser` / `Password` | Credentials orchestrator uses to connect to your MySQL instances. This user needs `SUPER`, `PROCESS`, `REPLICATION SLAVE`, and `REPLICATION CLIENT` privileges. | +| `BackendDB` / `SQLite3DataFile` | Use SQLite as orchestrator's own backend -- no external database needed. | +| `InstancePollSeconds` | How often orchestrator polls each MySQL instance. | +| `RecoverMasterClusterFilters` | Which clusters are eligible for automatic master recovery. `["*"]` enables all. | + +> **Tip:** For production deployments, use a MySQL backend instead of SQLite. See the [full configuration docs](configuration.md) for details. + +## Step 3: Create a MySQL user on your topology + +On your MySQL master (this will replicate to all replicas): + +```sql +CREATE USER 'orc_client_user'@'%' IDENTIFIED BY 'orc_client_password'; +GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orc_client_user'@'%'; +``` + +## Step 4: Start orchestrator + +```bash +bin/orchestrator -config orchestrator.conf.json http +``` + +You should see output indicating the service has started and is listening on port 3000. + +## Step 5: Discover your topology + +Tell orchestrator about your MySQL master. Replace `your-master-host` with the actual hostname or IP: + +```bash +curl http://localhost:3000/api/discover/your-master-host/3306 +``` + +Orchestrator will connect to this instance, discover its replicas, and recursively crawl the entire topology. + +## Step 6: View in the web UI + +Open your browser to: + +``` +http://localhost:3000 +``` + +You will see your MySQL replication topology visualized as an interactive tree. From here you can: + +- Click on instances to see their details +- Drag and drop replicas to move them between masters +- View replication lag and errors at a glance + +## Step 7: Verify via CLI + +List discovered clusters: + +```bash +curl http://localhost:3000/api/clusters +``` + +View the topology as ASCII art: + +```bash +curl http://localhost:3000/api/topology/your-master-host/3306 +``` + +Check cluster health: + +```bash +curl http://localhost:3000/api/replication-analysis +``` + +## Next steps + +- [Full configuration guide](configuration.md) -- backend database options, discovery tuning, security, and more +- [ProxySQL integration](proxysql-hooks.md) -- built-in failover hooks for ProxySQL +- [API v2 reference](api-v2.md) -- structured REST API with proper HTTP status codes +- [Failure detection & recovery](failure-detection.md) -- how orchestrator detects and recovers from failures +- [High availability](high-availability.md) -- running orchestrator itself in an HA configuration +- [First steps](first-steps.md) -- deeper walkthrough of CLI commands and topology operations +- [Tutorials](tutorials.md) -- step-by-step guides for common workflows diff --git a/docs/reference.md b/docs/reference.md new file mode 100644 index 00000000..244f389d --- /dev/null +++ b/docs/reference.md @@ -0,0 +1,1365 @@ +# Orchestrator Reference Manual + +Complete reference for orchestrator configuration, CLI commands, and HTTP API endpoints. + +**Source of truth:** This document is generated from the orchestrator source code. Configuration fields come from the `Configuration` struct in `go/config/config.go`. CLI commands come from `go/app/cli.go`. API endpoints come from `go/http/api.go` and `go/http/apiv2.go`. + +--- + +## Table of Contents + +- [1. Configuration Reference](#1-configuration-reference) + - [1.1 General / Debug](#11-general--debug) + - [1.2 HTTP / Network](#12-http--network) + - [1.3 MySQL Topology Credentials](#13-mysql-topology-credentials) + - [1.4 MySQL Topology TLS](#14-mysql-topology-tls) + - [1.5 PostgreSQL Topology](#15-postgresql-topology) + - [1.6 Orchestrator Backend Database](#16-orchestrator-backend-database) + - [1.7 Orchestrator Backend TLS](#17-orchestrator-backend-tls) + - [1.8 MySQL Connection Timeouts](#18-mysql-connection-timeouts) + - [1.9 Raft Consensus](#19-raft-consensus) + - [1.10 Discovery](#110-discovery) + - [1.11 Instance Polling and Buffering](#111-instance-polling-and-buffering) + - [1.12 Hostname Resolution](#112-hostname-resolution) + - [1.13 Replication Lag and Checks](#113-replication-lag-and-checks) + - [1.14 Cluster Classification and Detection](#114-cluster-classification-and-detection) + - [1.15 Pseudo-GTID](#115-pseudo-gtid) + - [1.16 Binlog Analysis](#116-binlog-analysis) + - [1.17 Failure Detection and Recovery](#117-failure-detection-and-recovery) + - [1.18 Recovery Hook Processes](#118-recovery-hook-processes) + - [1.19 Master Failover Behavior](#119-master-failover-behavior) + - [1.20 Semi-Sync](#120-semi-sync) + - [1.21 Authentication and Security](#121-authentication-and-security) + - [1.22 SSL / TLS (Web)](#122-ssl--tls-web) + - [1.23 Agents](#123-agents) + - [1.24 Agent TLS](#124-agent-tls) + - [1.25 Audit](#125-audit) + - [1.26 Pools](#126-pools) + - [1.27 Promotion and Filters](#127-promotion-and-filters) + - [1.28 Consul / ZooKeeper / KV Stores](#128-consul--zookeeper--kv-stores) + - [1.29 Graphite](#129-graphite) + - [1.30 ProxySQL](#130-proxysql) + - [1.31 Prometheus](#131-prometheus) + - [1.32 Web UI / Miscellaneous](#132-web-ui--miscellaneous) +- [2. CLI Command Reference](#2-cli-command-reference) + - [2.1 Smart Relocation](#21-smart-relocation) + - [2.2 Classic file:pos Relocation](#22-classic-filepos-relocation) + - [2.3 Binlog Server Relocation](#23-binlog-server-relocation) + - [2.4 GTID Relocation](#24-gtid-relocation) + - [2.5 Pseudo-GTID Relocation](#25-pseudo-gtid-relocation) + - [2.6 Replication, General](#26-replication-general) + - [2.7 Replication Information](#27-replication-information) + - [2.8 Instance](#28-instance) + - [2.9 Binary Logs](#29-binary-logs) + - [2.10 Pools](#210-pools) + - [2.11 Information](#211-information) + - [2.12 Key-Value Stores](#212-key-value-stores) + - [2.13 Tags](#213-tags) + - [2.14 Instance Management](#214-instance-management) + - [2.15 Recovery](#215-recovery) + - [2.16 Instance Meta](#216-instance-meta) + - [2.17 Meta / Orchestrator Operations](#217-meta--orchestrator-operations) + - [2.18 Global Recoveries](#218-global-recoveries) + - [2.19 Bulk Operations](#219-bulk-operations) + - [2.20 ProxySQL](#220-proxysql) + - [2.21 Agent](#221-agent) +- [3. API v1 Reference](#3-api-v1-reference) + - [3.1 Smart Relocation](#31-smart-relocation) + - [3.2 Classic file:pos Relocation](#32-classic-filepos-relocation) + - [3.3 Binlog Server](#33-binlog-server) + - [3.4 GTID Relocation](#34-gtid-relocation) + - [3.5 Pseudo-GTID Relocation](#35-pseudo-gtid-relocation) + - [3.6 Topology / Promotion](#36-topology--promotion) + - [3.7 Replication Control](#37-replication-control) + - [3.8 Replication Information](#38-replication-information) + - [3.9 Instance Control](#39-instance-control) + - [3.10 Binary Logs](#310-binary-logs) + - [3.11 Pools](#311-pools) + - [3.12 Search and Discovery](#312-search-and-discovery) + - [3.13 Cluster Information](#313-cluster-information) + - [3.14 Tags](#314-tags) + - [3.15 Instance Management](#315-instance-management) + - [3.16 Recovery and Analysis](#316-recovery-and-analysis) + - [3.17 Problems and Audit](#317-problems-and-audit) + - [3.18 Health and Raft](#318-health-and-raft) + - [3.19 Hostname and Configuration](#319-hostname-and-configuration) + - [3.20 Bulk Operations](#320-bulk-operations) + - [3.21 Discovery Metrics](#321-discovery-metrics) + - [3.22 Agents](#322-agents) + - [3.23 ProxySQL](#323-proxysql) +- [4. API v2 Reference](#4-api-v2-reference) +- [5. ProxySQL Configuration](#5-proxysql-configuration) +- [6. Observability](#6-observability) + +--- + +## 1. Configuration Reference + +Orchestrator is configured via a JSON file. All fields belong to the `Configuration` struct in `go/config/config.go`. Passwords support environment variable substitution in the form `${ENV_VAR_NAME}`. + +### 1.1 General / Debug + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `Debug` | bool | `false` | Set debug mode (similar to `--debug` option) | +| `EnableSyslog` | bool | `false` | Should logs be directed (in addition) to syslog daemon? | + +### 1.2 HTTP / Network + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ListenAddress` | string | `":3000"` | Where orchestrator HTTP should listen for TCP | +| `ListenSocket` | string | `""` | Where orchestrator HTTP should listen for unix socket (when given, TCP is disabled) | +| `HTTPAdvertise` | string | `""` | For raft setups, HTTP address this node advertises to peers. Must include scheme, host, and port (e.g. `http://11.22.33.44:3030`). Must not include a path. | +| `AgentsServerPort` | string | `":3001"` | Port orchestrator agents talk back to | +| `URLPrefix` | string | `""` | URL prefix to run orchestrator on non-root web path, e.g. `/orchestrator` for running behind nginx | +| `StatusEndpoint` | string | `"/api/status"` | Override the status endpoint | +| `StatusOUVerify` | bool | `false` | If true, try to verify OUs when Mutual TLS is on | + +### 1.3 MySQL Topology Credentials + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `MySQLTopologyUser` | string | `""` | Username for connecting to topology MySQL instances | +| `MySQLTopologyPassword` | string | `""` | Password for connecting to topology MySQL instances. Supports `${ENV_VAR}` syntax. | +| `MySQLTopologyCredentialsConfigFile` | string | `""` | my.cnf-style config file for topology credentials. Reads `user` and `password` from the `[client]` section. | +| `MySQLTopologyMaxAllowedPacket` | int32 | `-1` | `max_allowed_packet` value when connecting to topology instances | + +### 1.4 MySQL Topology TLS + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `MySQLTopologySSLPrivateKeyFile` | string | `""` | Private key file for TLS authentication with topology instances | +| `MySQLTopologySSLCertFile` | string | `""` | Certificate PEM file for TLS authentication with topology instances | +| `MySQLTopologySSLCAFile` | string | `""` | Certificate Authority PEM file for topology instance TLS | +| `MySQLTopologySSLSkipVerify` | bool | `false` | If true, do not strictly validate mutual TLS certs for topology instances | +| `MySQLTopologyUseMutualTLS` | bool | `false` | Turn on TLS authentication with the topology MySQL instances | +| `MySQLTopologyUseMixedTLS` | bool | `true` | Mixed TLS and non-TLS authentication with topology instances | +| `TLSCacheTTLFactor` | uint | `100` | Factor of `InstancePollSeconds` used as TLS info cache expiry | + +### 1.5 PostgreSQL Topology + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `PostgreSQLTopologyUser` | string | `""` | Username for connecting to PostgreSQL topology instances | +| `PostgreSQLTopologyPassword` | string | `""` | Password for connecting to PostgreSQL topology instances | +| `PostgreSQLSSLMode` | string | `"require"` | SSL mode for PostgreSQL connections: `disable`, `require`, `verify-ca`, `verify-full` | + +### 1.6 Orchestrator Backend Database + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `BackendDB` | string | `"mysql"` | EXPERIMENTAL: type of backend db; either `"mysql"` or `"sqlite3"` | +| `SQLite3DataFile` | string | `""` | Full path to sqlite3 data file (required when `BackendDB` is `"sqlite3"`) | +| `SkipOrchestratorDatabaseUpdate` | bool | `false` | When true, do not check backend schema nor attempt to update it. Useful when running multiple orchestrator versions. | +| `PanicIfDifferentDatabaseDeploy` | bool | `false` | When true, panic if backend DB was provisioned by a different version | +| `MySQLOrchestratorHost` | string | `""` | Hostname of the orchestrator backend MySQL instance | +| `MySQLOrchestratorPort` | uint | `3306` | Port of the orchestrator backend MySQL instance | +| `MySQLOrchestratorDatabase` | string | `""` | Database name for orchestrator backend | +| `MySQLOrchestratorUser` | string | `""` | Username for orchestrator backend MySQL | +| `MySQLOrchestratorPassword` | string | `""` | Password for orchestrator backend MySQL. Supports `${ENV_VAR}` syntax. | +| `MySQLOrchestratorCredentialsConfigFile` | string | `""` | my.cnf-style config file for backend credentials. Reads `user` and `password` from `[client]`. | +| `MySQLOrchestratorMaxPoolConnections` | int | `128` | Maximum size of the connection pool to the backend DB | +| `MySQLOrchestratorReadTimeoutSeconds` | int | `30` | Seconds before backend MySQL read operation is aborted (driver-side) | +| `MySQLOrchestratorRejectReadOnly` | bool | `false` | Reject read-only connections to backend | +| `MySQLOrchestratorMaxAllowedPacket` | int32 | `-1` | `max_allowed_packet` for backend MySQL connections | + +### 1.7 Orchestrator Backend TLS + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `MySQLOrchestratorSSLPrivateKeyFile` | string | `""` | Private key file for TLS with backend MySQL | +| `MySQLOrchestratorSSLCertFile` | string | `""` | Certificate PEM file for TLS with backend MySQL | +| `MySQLOrchestratorSSLCAFile` | string | `""` | Certificate Authority PEM file for backend MySQL TLS | +| `MySQLOrchestratorSSLSkipVerify` | bool | `false` | Skip strict validation of mutual TLS certs for backend | +| `MySQLOrchestratorUseMutualTLS` | bool | `false` | Turn on TLS authentication with the backend MySQL instance | + +### 1.8 MySQL Connection Timeouts + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `MySQLConnectTimeoutSeconds` | int | `2` | Seconds before connection is aborted (driver-side) | +| `MySQLDiscoveryReadTimeoutSeconds` | int | `10` | Seconds before topology read for discovery is aborted | +| `MySQLTopologyReadTimeoutSeconds` | int | `600` | Seconds before topology read (non-discovery) is aborted | +| `MySQLConnectionLifetimeSeconds` | int | `0` | Seconds the driver keeps a connection alive before recycling. 0 means unlimited. | +| `DefaultInstancePort` | int | `3306` | Default port when not specified on command line | + +### 1.9 Raft Consensus + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `RaftEnabled` | bool | `false` | When true, set up orchestrator in a raft consensus layout. When false, all `Raft*` variables are ignored. | +| `RaftBind` | string | `"127.0.0.1:10008"` | Address to bind for raft communication | +| `RaftAdvertise` | string | `""` | Address to advertise for raft. Defaults to `RaftBind` if empty. | +| `RaftDataDir` | string | `""` | Directory for raft data storage (required when `RaftEnabled` is true) | +| `DefaultRaftPort` | int | `10008` | Default port for `RaftNodes` entries that don't specify a port | +| `RaftNodes` | []string | `[]` | Raft nodes to make initial connection with | +| `ExpectFailureAnalysisConcensus` | bool | `true` | Expect failure analysis consensus before recovery | + +### 1.10 Discovery + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `DiscoverByShowSlaveHosts` | bool | `false` | Attempt `SHOW SLAVE HOSTS` before `PROCESSLIST` | +| `DiscoveryMaxConcurrency` | uint | `300` | Number of goroutines doing host discovery | +| `DiscoveryQueueCapacity` | uint | `100000` | Buffer size of the discovery queue. Should be greater than the number of DB instances. | +| `DiscoveryQueueMaxStatisticsSize` | int | `120` | Maximum number of individual per-second statistics kept for the discovery queue | +| `DiscoveryCollectionRetentionSeconds` | uint | `120` | Seconds to retain discovery collection information | +| `DiscoverySeeds` | []string | `[]` | Hard-coded array of `hostname:port` ensuring orchestrator discovers these on startup | +| `DiscoveryIgnoreReplicaHostnameFilters` | []string | `[]` | Regexp filters to prevent auto-discovering new replicas | +| `DiscoveryIgnoreMasterHostnameFilters` | []string | `[]` | Regexp filters to prevent auto-discovering a master | +| `DiscoveryIgnoreHostnameFilters` | []string | `[]` | Regexp filters to prevent discovering instances of any kind | +| `DiscoveryIgnoreReplicationUsernameFilters` | []string | `[]` | Regexp filters to prevent discovering instances with a matching replication username | +| `EnableDiscoveryFiltersLogs` | bool | `true` | Log filtered instances during discovery | +| `UnseenInstanceForgetHours` | uint | `240` | Hours after which an unseen instance is forgotten | +| `SnapshotTopologiesIntervalHours` | uint | `0` | Interval in hours between snapshot-topologies invocations. 0 disables. | + +### 1.11 Instance Polling and Buffering + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `InstancePollSeconds` | uint | `5` | Seconds between instance reads | +| `DeadInstancePollSecondsMultiplyFactor` | float32 | `1` | Multiply factor for dead instance poll interval. Must be >= 1. | +| `DeadInstancePollSecondsMax` | uint | `300` | Maximum delay between dead instance read attempts | +| `DeadInstanceDiscoveryMaxConcurrency` | uint | `0` | Number of goroutines doing dead host discovery. 0 means unlimited. | +| `DeadInstanceDiscoveryLogsEnabled` | bool | `false` | Enable logs for dead instance discoveries | +| `ReasonableInstanceCheckSeconds` | uint | `1` | Seconds an instance read is allowed to take before `LastCheckValid` becomes false | +| `InstanceWriteBufferSize` | int | `100` | Max number of instances to flush in one `INSERT ... ON DUPLICATE KEY UPDATE` | +| `BufferInstanceWrites` | bool | `false` | Set to true for write-optimization on backend table (writes can be stale and overwrite non-stale data) | +| `InstanceFlushIntervalMilliseconds` | int | `100` | Max interval between instance write buffer flushes | +| `InstanceBulkOperationsWaitTimeoutSeconds` | uint | `10` | Time to wait on a single instance during bulk operations | +| `SkipMaxScaleCheck` | bool | `true` | Skip MaxScale BinlogServer checks. Set to true if you never have MaxScale in your topology. | +| `LowerReplicaVersionAllowed` | bool | `false` | Allow lower version replica to replicate from higher version master (produces a warning) | +| `UseSuperReadOnly` | bool | `false` | Should orchestrator use `super_read_only` any time it sets `read_only` | +| `MaxConcurrentReplicaOperations` | int | `5` | Maximum number of concurrent operations on replicas | + +### 1.12 Hostname Resolution + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `HostnameResolveMethod` | string | `"default"` | Method to normalize hostname: `"none"`, `"default"`, `"cname"` | +| `MySQLHostnameResolveMethod` | string | `"@@hostname"` | Method to normalize hostname via MySQL: `"none"`, `"@@hostname"`, `"@@report_host"` | +| `SkipBinlogServerUnresolveCheck` | bool | `true` | Skip the double-check that an unresolved hostname resolves back for binlog servers | +| `ExpiryHostnameResolvesMinutes` | int | `60` | Minutes after which hostname resolves expire | +| `RejectHostnameResolvePattern` | string | `""` | Regexp pattern for resolved hostnames that will be rejected (not cached, not written to db) | + +### 1.13 Replication Lag and Checks + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `SlaveLagQuery` | string | `""` | Synonym for `ReplicationLagQuery` (deprecated, use `ReplicationLagQuery`) | +| `ReplicationLagQuery` | string | `""` | Custom query to check replica lag (e.g. heartbeat table). Must return one row, one numeric column. | +| `ReplicationCredentialsQuery` | string | `""` | Custom query returning replication credentials: username, password, SSLCaCert, SSLCert, SSLKey. Optional. | +| `ReasonableReplicationLagSeconds` | int | `10` | Above this value, replication lag is considered a problem | +| `ReasonableMaintenanceReplicationLagSeconds` | int | `20` | Above this value, move-up and move-below are blocked | +| `ProblemIgnoreHostnameFilters` | []string | `[]` | Regexp filters to minimize problem visualization for matching hostnames | +| `VerifyReplicationFilters` | bool | `false` | Include replication filters check before approving topology refactoring | +| `ReduceReplicationAnalysisCount` | bool | `true` | When true, analysis only reports instances where problems are possible (skips most leaf nodes) | + +### 1.14 Cluster Classification and Detection + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ClusterNameToAlias` | map[string]string | `{}` | Map between regex matching cluster name to a human-friendly alias | +| `DetectClusterAliasQuery` | string | `""` | Optional query (on topology instance) that returns the alias of a cluster. Executed on master only. Must return one row, one column. | +| `DetectClusterDomainQuery` | string | `""` | Optional query (on topology instance) that returns the VIP/CNAME/domain for the cluster master. Must return one row, one column. | +| `DetectInstanceAliasQuery` | string | `""` | Optional query (on topology instance) that returns the alias of an instance. Must return one row, one column. | +| `DetectPromotionRuleQuery` | string | `""` | Optional query (on topology instance) that returns the promotion rule of an instance. Must return one row, one column. | +| `DataCenterPattern` | string | `""` | Regexp with one group, extracting datacenter name from hostname | +| `RegionPattern` | string | `""` | Regexp with one group, extracting region name from hostname | +| `PhysicalEnvironmentPattern` | string | `""` | Regexp with one group, extracting physical environment from hostname | +| `DetectDataCenterQuery` | string | `""` | Optional query returning the data center of an instance. Overrides `DataCenterPattern`. | +| `DetectRegionQuery` | string | `""` | Optional query returning the region of an instance. Overrides `RegionPattern`. | +| `DetectPhysicalEnvironmentQuery` | string | `""` | Optional query returning the physical environment of an instance. Overrides `PhysicalEnvironmentPattern`. | +| `DetectSemiSyncEnforcedQuery` | string | `""` | Optional query to determine whether semi-sync is fully enforced. Must return 0 or 1. | +| `RemoveTextFromHostnameDisplay` | string | `""` | Text to strip from hostname on cluster/clusters pages | +| `ReadOnly` | bool | `false` | When true, orchestrator operates in read-only mode | + +### 1.15 Pseudo-GTID + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `AutoPseudoGTID` | bool | `false` | Should orchestrator automatically inject Pseudo-GTID entries to masters. When true, overrides `PseudoGTIDPattern` and related settings. | +| `PseudoGTIDPattern` | string | `""` | Pattern to look for in binlogs as a unique entry. When empty, Pseudo-GTID refactoring is disabled. | +| `PseudoGTIDPatternIsFixedSubstring` | bool | `false` | If true, `PseudoGTIDPattern` is a fixed substring (not regex), boosting search time | +| `PseudoGTIDMonotonicHint` | string | `""` | Substring in Pseudo-GTID entry indicating entries are monotonically increasing | +| `DetectPseudoGTIDQuery` | string | `""` | Optional query to authoritatively decide whether Pseudo-GTID is enabled on an instance | + +### 1.16 Binlog Analysis + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `BinlogEventsChunkSize` | int | `10000` | Chunk size (X) for `SHOW BINLOG EVENTS LIMIT ?,X`. Smaller means less locking, more work. | +| `SkipBinlogEventsContaining` | []string | `[]` | When scanning binlogs for Pseudo-GTID, skip entries containing these substrings (not regex). | + +### 1.17 Failure Detection and Recovery + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `FailureDetectionPeriodBlockMinutes` | int | `60` | Time (minutes) an instance's failure discovery is kept active, preventing concurrent discoveries | +| `RecoveryPeriodBlockMinutes` | int | `60` | (Deprecated: use `RecoveryPeriodBlockSeconds`) Time for which a recovery is kept active | +| `RecoveryPeriodBlockSeconds` | int | `3600` | Overrides `RecoveryPeriodBlockMinutes`. Time for which a recovery is kept active. | +| `RecoveryIgnoreHostnameFilters` | []string | `[]` | Recovery analysis completely ignores hosts matching these patterns | +| `RecoverMasterClusterFilters` | []string | `[]` | Only do master recovery on clusters matching these regexp patterns (`".*"` matches all) | +| `RecoverIntermediateMasterClusterFilters` | []string | `[]` | Only do intermediate-master recovery on clusters matching these patterns | +| `RecoverNonWriteableMaster` | bool | `false` | When true, treat a read-only master as a failure scenario and attempt to make it writeable | + +### 1.18 Recovery Hook Processes + +All process hook fields accept a list of shell commands. Placeholders available: `{failureType}`, `{instanceType}`, `{isMaster}`, `{isCoMaster}`, `{failureDescription}`, `{command}`, `{failedHost}`, `{failureCluster}`, `{failureClusterAlias}`, `{failureClusterDomain}`, `{failedPort}`, `{successorHost}`, `{successorPort}`, `{successorAlias}`, `{successorBinlogCoordinates}`, `{countReplicas}`, `{replicaHosts}`, `{isDowntimed}`, `{autoMasterRecovery}`, `{autoIntermediateMasterRecovery}`, `{isSuccessful}`, `{lostReplicas}`, `{countLostReplicas}`. + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ProcessesShellCommand` | string | `"bash"` | Shell that executes command scripts | +| `OnFailureDetectionProcesses` | []string | `[]` | Processes to execute when detecting a failover scenario (before deciding whether to failover) | +| `PreGracefulTakeoverProcesses` | []string | `[]` | Processes before a graceful takeover. Non-zero exit aborts the operation. | +| `PreFailoverProcesses` | []string | `[]` | Processes before a failover. Non-zero exit aborts the operation. | +| `PostFailoverProcesses` | []string | `[]` | Processes after a failover | +| `PostUnsuccessfulFailoverProcesses` | []string | `[]` | Processes after a not-completely-successful failover | +| `PostMasterFailoverProcesses` | []string | `[]` | Processes after a master failover | +| `PostIntermediateMasterFailoverProcesses` | []string | `[]` | Processes after an intermediate-master failover | +| `PostGracefulTakeoverProcesses` | []string | `[]` | Processes after a graceful master takeover | +| `PostTakeMasterProcesses` | []string | `[]` | Processes after a successful take-master event | + +### 1.19 Master Failover Behavior + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `CoMasterRecoveryMustPromoteOtherCoMaster` | bool | `true` | When true, only the other co-master can be promoted. When false, any instance is eligible. | +| `DetachLostSlavesAfterMasterFailover` | bool | `true` | Synonym for `DetachLostReplicasAfterMasterFailover` | +| `DetachLostReplicasAfterMasterFailover` | bool | `false` | Forcibly detach replicas that were more up-to-date than the promoted replica | +| `ApplyMySQLPromotionAfterMasterFailover` | bool | `true` | Should orchestrator apply MySQL master promotion: `set read_only=0`, detach replication, etc. | +| `PreventCrossDataCenterMasterFailover` | bool | `false` | When true, cross-DC failover is not allowed | +| `PreventCrossRegionMasterFailover` | bool | `false` | When true, cross-region failover is not allowed | +| `MasterFailoverLostInstancesDowntimeMinutes` | uint | `0` | Minutes to downtime servers lost after master failover. 0 disables. | +| `MasterFailoverDetachSlaveMasterHost` | bool | `false` | Synonym for `MasterFailoverDetachReplicaMasterHost` | +| `MasterFailoverDetachReplicaMasterHost` | bool | `false` | Issue detach-replica-master-host on newly promoted master. Meaningless if `ApplyMySQLPromotionAfterMasterFailover` is true. | +| `FailMasterPromotionOnLagMinutes` | uint | `0` | Fail master promotion if candidate replica is lagging >= this many minutes. Requires `ReplicationLagQuery`. | +| `FailMasterPromotionIfSQLThreadNotUpToDate` | bool | `false` | Abort promotion if candidate has not consumed all relay logs. Cannot be true with `DelayMasterPromotionIfSQLThreadNotUpToDate`. | +| `DelayMasterPromotionIfSQLThreadNotUpToDate` | bool | `false` | Delay promotion until SQL thread catches up. Cannot be true with `FailMasterPromotionIfSQLThreadNotUpToDate`. | +| `PostponeSlaveRecoveryOnLagMinutes` | uint | `0` | Synonym for `PostponeReplicaRecoveryOnLagMinutes` | +| `PostponeReplicaRecoveryOnLagMinutes` | uint | `0` | On crash recovery, lagging replicas are resurrected late, after master/IM election. 0 disables. | + +### 1.20 Semi-Sync + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `EnforceExactSemiSyncReplicas` | bool | `false` | If true, semi-sync replicas are enabled/disabled to match the wait count in priority order | +| `RecoverLockedSemiSyncMaster` | bool | `false` | If true, recover from `LockedSemiSync` state by enabling semi-sync on replicas | +| `ReasonableLockedSemiSyncMasterSeconds` | uint | `0` | Time to evaluate the LockedSemiSync hypothesis. Falls back to `ReasonableReplicationLagSeconds` if 0. | + +### 1.21 Authentication and Security + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `AuthenticationMethod` | string | `""` | Type of authentication: `""` (none), `"basic"`, `"multi"`, `"proxy"`, `"token"` | +| `HTTPAuthUser` | string | `""` | Username for HTTP Basic authentication (blank disables) | +| `HTTPAuthPassword` | string | `""` | Password for HTTP Basic authentication | +| `AuthUserHeader` | string | `"X-Forwarded-User"` | HTTP header indicating auth user when `AuthenticationMethod` is `"proxy"` | +| `PowerAuthUsers` | []string | `["*"]` | On `AuthenticationMethod == "proxy"`, list of users that can make changes. All others are read-only. | +| `PowerAuthGroups` | []string | `[]` | Unix groups the authenticated user must belong to for write access | +| `AccessTokenUseExpirySeconds` | uint | `60` | Time by which an issued token must be used | +| `AccessTokenExpiryMinutes` | uint | `1440` | Time after which HTTP access token expires | +| `OAuthClientId` | string | `""` | OAuth client ID | +| `OAuthClientSecret` | string | `""` | OAuth client secret | +| `OAuthScopes` | []string | `nil` | OAuth scopes | + +### 1.22 SSL / TLS (Web) + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `UseSSL` | bool | `false` | Use SSL on the server web port | +| `UseMutualTLS` | bool | `false` | Use mutual TLS for web and API connections | +| `SSLSkipVerify` | bool | `false` | Ignore SSL certification errors | +| `SSLPrivateKeyFile` | string | `""` | SSL private key file | +| `SSLCertFile` | string | `""` | SSL certificate file | +| `SSLCAFile` | string | `""` | Certificate Authority file | +| `SSLValidOUs` | []string | `[]` | Valid organizational units for mutual TLS | + +### 1.23 Agents + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ServeAgentsHttp` | bool | `false` | Spawn another HTTP interface dedicated for orchestrator-agent | +| `AgentPollMinutes` | uint | `60` | Minutes between agent polling | +| `UnseenAgentForgetHours` | uint | `6` | Hours after which an unseen agent is forgotten | +| `StaleSeedFailMinutes` | uint | `60` | Minutes after which a stale (no progress) seed is considered failed | +| `SeedAcceptableBytesDiff` | int64 | `8192` | Acceptable byte difference between seed source and target data | +| `SeedWaitSecondsBeforeSend` | int64 | `2` | Seconds to wait before starting send command on agent | + +### 1.24 Agent TLS + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `AgentsUseSSL` | bool | `false` | Listen on agents port with SSL and connect to agents via SSL | +| `AgentsUseMutualTLS` | bool | `false` | Use mutual TLS for server-to-agent communication | +| `AgentSSLSkipVerify` | bool | `false` | Ignore SSL certification errors for agents | +| `AgentSSLPrivateKeyFile` | string | `""` | Agent SSL private key file | +| `AgentSSLCertFile` | string | `""` | Agent SSL certificate file | +| `AgentSSLCAFile` | string | `""` | Agent Certificate Authority file | +| `AgentSSLValidOUs` | []string | `[]` | Valid organizational units for mutual TLS with agents | + +### 1.25 Audit + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `AuditLogFile` | string | `""` | Name of log file for audit operations. Empty disables file logging. | +| `AuditToSyslog` | bool | `false` | Write audit messages to syslog | +| `AuditToBackendDB` | bool | `false` | Write audit messages to the backend DB's `audit` table | +| `AuditPurgeDays` | uint | `7` | Days after which audit entries are purged from the database | + +### 1.26 Pools + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `SupportFuzzyPoolHostnames` | bool | `true` | Allow `submit-pool-instances` to pass fuzzy (non-FQDN) instance names. Implies more backend queries. | +| `InstancePoolExpiryMinutes` | uint | `60` | Minutes after which `database_instance_pool` entries expire | +| `CandidateInstanceExpireMinutes` | uint | `60` | Minutes after which a candidate replica suggestion expires | + +### 1.27 Promotion and Filters + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `PromotionIgnoreHostnameFilters` | []string | `[]` | Orchestrator will not promote replicas with hostnames matching these patterns | +| `OSCIgnoreHostnameFilters` | []string | `[]` | OSC replica recommendation will ignore matching hostnames | + +### 1.28 Consul / ZooKeeper / KV Stores + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ConsulAddress` | string | `""` | Address of Consul HTTP API (e.g. `127.0.0.1:8500`) | +| `ConsulScheme` | string | `"http"` | Scheme for Consul: `http` or `https` | +| `ConsulAclToken` | string | `""` | ACL token for writing to Consul KV | +| `ConsulCrossDataCenterDistribution` | bool | `false` | Auto-deduce all Consul DCs and write KVs in all DCs | +| `ConsulKVStoreProvider` | string | `"consul"` | Consul KV store provider: `"consul"` or `"consul-txn"` | +| `ConsulMaxKVsPerTransaction` | int | `5` | Maximum KV operations per single Consul Transaction. Requires `"consul-txn"` provider. Range: 5-64. | +| `ZkAddress` | string | `""` | UNSUPPORTED YET. ZooKeeper server addresses in `srv1[:port1][,srv2[:port2]...]` format. Default port is 2181. | +| `KVClusterMasterPrefix` | string | `"mysql/master"` | Prefix for cluster master entries in KV stores | + +### 1.29 Graphite + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `GraphiteAddr` | string | `""` | Address of graphite port. If supplied, metrics will be written here. | +| `GraphitePath` | string | `""` | Prefix for graphite path. May include `{hostname}` placeholder. | +| `GraphiteConvertHostnameDotsToUnderscores` | bool | `true` | Convert hostname dots to underscores in graphite path | +| `GraphitePollSeconds` | int | `60` | Graphite writes interval. 0 disables. | + +### 1.30 ProxySQL + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ProxySQLAdminAddress` | string | `""` | Address of ProxySQL Admin interface (e.g. `127.0.0.1`). Empty disables ProxySQL hooks. | +| `ProxySQLAdminPort` | int | `6032` | Port of ProxySQL Admin interface | +| `ProxySQLAdminUser` | string | `"admin"` | Username for ProxySQL Admin | +| `ProxySQLAdminPassword` | string | `""` | Password for ProxySQL Admin | +| `ProxySQLAdminUseTLS` | bool | `false` | Use TLS for ProxySQL Admin connection | +| `ProxySQLWriterHostgroup` | int | `0` | ProxySQL hostgroup ID for the writer (master). Must be > 0 to enable hooks. | +| `ProxySQLReaderHostgroup` | int | `0` | ProxySQL hostgroup ID for readers (replicas). Optional. | +| `ProxySQLPreFailoverAction` | string | `"offline_soft"` | Pre-failover action on old master: `"offline_soft"`, `"weight_zero"`, or `"none"` | + +### 1.31 Prometheus + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `PrometheusEnabled` | bool | `true` | When true, expose Prometheus metrics on `/metrics` endpoint | + +### 1.32 Web UI / Miscellaneous + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `WebMessage` | string | `""` | If provided, shown on all web pages below the title bar | +| `PrependMessagesWithOrcIdentity` | string | `""` | Use `"FQDN"`, `"hostname"`, or `"custom"` to prefix error messages. Empty/`"none"` disables. | +| `CustomOrcIdentity` | string | `""` | Custom identity string when `PrependMessagesWithOrcIdentity` is `"custom"` | + +--- + +## 2. CLI Command Reference + +Usage: `orchestrator -c [-i [:]] [-d [:]] [options]` + +When `RaftEnabled` is true, CLI access is blocked by default. Use `--ignore-raft-setup` to override, or use the `orchestrator-client` script which speaks to the HTTP API. + +### Command Synonyms + +The following legacy command names are automatically mapped to their current equivalents: + +| Legacy Name | Current Name | +|-------------|-------------| +| `stop-slave` | `stop-replica` | +| `start-slave` | `start-replica` | +| `restart-slave` | `restart-replica` | +| `reset-slave` | `reset-replica` | +| `restart-slave-statements` | `restart-replica-statements` | +| `relocate-slaves` | `relocate-replicas` | +| `regroup-slaves` | `regroup-replicas` | +| `move-up-slaves` | `move-up-replicas` | +| `repoint-slaves` | `repoint-replicas` | +| `enslave-siblings` | `take-siblings` | +| `enslave-master` | `take-master` | +| `get-candidate-slave` | `get-candidate-replica` | +| `move-slaves-gtid` | `move-replicas-gtid` | +| `regroup-slaves-gtid` | `regroup-replicas-gtid` | +| `match-slaves` | `match-replicas` | +| `match-up-slaves` | `match-up-replicas` | +| `regroup-slaves-pgtid` | `regroup-replicas-pgtid` | +| `which-cluster-osc-slaves` | `which-cluster-osc-replicas` | +| `which-cluster-gh-ost-slaves` | `which-cluster-gh-ost-replicas` | +| `which-slaves` | `which-replicas` | +| `detach-slave`, `detach-replica`, `detach-slave-master-host` | `detach-replica-master-host` | +| `reattach-slave`, `reattach-replica`, `reattach-slave-master-host` | `reattach-replica-master-host` | + +### 2.1 Smart Relocation + +| Command | Description | Example | +|---------|-------------|---------| +| `relocate` | Relocate a replica beneath another instance | `orchestrator -c relocate -i replica1:3306 -d newmaster:3306` | +| `relocate-below` | Synonym to `relocate` (will be deprecated) | `orchestrator -c relocate-below -i replica1:3306 -d newmaster:3306` | +| `relocate-replicas` | Relocates all or part of the replicas of a given instance under another instance | `orchestrator -c relocate-replicas -i oldmaster:3306 -d newmaster:3306` | +| `take-siblings` | Turn all siblings of a replica into its sub-replicas | `orchestrator -c take-siblings -i replica1:3306` | +| `regroup-replicas` | Given an instance, pick one of its replicas and make it local master of its siblings | `orchestrator -c regroup-replicas -i master:3306` | + +### 2.2 Classic file:pos Relocation + +| Command | Description | Example | +|---------|-------------|---------| +| `move-up` | Move a replica one level up the topology | `orchestrator -c move-up -i replica1:3306` | +| `move-up-replicas` | Moves replicas of the given instance one level up the topology | `orchestrator -c move-up-replicas -i intermediate:3306` | +| `move-below` | Moves a replica beneath its sibling. Both must replicate from same master. | `orchestrator -c move-below -i replica1:3306 -d sibling:3306` | +| `move-equivalent` | Moves a replica beneath another server using previously recorded equivalence coordinates | `orchestrator -c move-equivalent -i replica1:3306 -d target:3306` | +| `repoint` | Make the given instance replicate from another instance without changing binlog coordinates. Use with care. | `orchestrator -c repoint -i replica1:3306 -d newmaster:3306` | +| `repoint-replicas` | Repoint all replicas of given instance to replicate back from the instance. Use with care. | `orchestrator -c repoint-replicas -i master:3306` | +| `take-master` | Turn an instance into a master of its own master; essentially switch the two | `orchestrator -c take-master -i replica1:3306` | +| `make-co-master` | Create a master-master replication. Given instance is a replica replicating directly from a master. | `orchestrator -c make-co-master -i replica1:3306` | +| `get-candidate-replica` | Suggest the most up-to-date replica of a given instance that is good for promotion | `orchestrator -c get-candidate-replica -i master:3306` | + +### 2.3 Binlog Server Relocation + +| Command | Description | Example | +|---------|-------------|---------| +| `regroup-replicas-bls` | Regroup Binlog Server replicas of a given instance | `orchestrator -c regroup-replicas-bls -i master:3306` | + +### 2.4 GTID Relocation + +| Command | Description | Example | +|---------|-------------|---------| +| `move-gtid` | Move a replica beneath another instance using GTID | `orchestrator -c move-gtid -i replica1:3306 -d newmaster:3306` | +| `move-replicas-gtid` | Moves all replicas of a given instance under another using GTID | `orchestrator -c move-replicas-gtid -i oldmaster:3306 -d newmaster:3306` | +| `regroup-replicas-gtid` | Given an instance, pick one of its replicas and make it local master of siblings, using GTID | `orchestrator -c regroup-replicas-gtid -i master:3306` | + +### 2.5 Pseudo-GTID Relocation + +| Command | Description | Example | +|---------|-------------|---------| +| `match` | Matches a replica beneath another instance using Pseudo-GTID | `orchestrator -c match -i replica1:3306 -d target:3306` | +| `match-up` | Transport the replica one level up the hierarchy using Pseudo-GTID | `orchestrator -c match-up -i replica1:3306` | +| `rematch` | Reconnect a replica onto its master via Pseudo-GTID | `orchestrator -c rematch -i replica1:3306` | +| `match-replicas` | Matches all replicas of a given instance under another using Pseudo-GTID | `orchestrator -c match-replicas -i oldmaster:3306 -d target:3306` | +| `match-up-replicas` | Matches replicas of the given instance one level up, making them siblings, using Pseudo-GTID | `orchestrator -c match-up-replicas -i intermediate:3306` | +| `regroup-replicas-pgtid` | Given an instance, pick one of its replicas and make it local master of siblings, using Pseudo-GTID | `orchestrator -c regroup-replicas-pgtid -i master:3306` | + +### 2.6 Replication, General + +| Command | Description | Example | +|---------|-------------|---------| +| `enable-gtid` | If possible, turn on GTID replication | `orchestrator -c enable-gtid -i replica1:3306` | +| `disable-gtid` | Turn off GTID replication, back to file:pos | `orchestrator -c disable-gtid -i replica1:3306` | +| `which-gtid-errant` | Get errant GTID set (empty if no errant GTID) | `orchestrator -c which-gtid-errant -i replica1:3306` | +| `gtid-errant-reset-master` | Reset master on instance, remove GTID errant transactions | `orchestrator -c gtid-errant-reset-master -i replica1:3306` | +| `skip-query` | Skip a single statement on a replica (GTID or non-GTID) | `orchestrator -c skip-query -i replica1:3306` | +| `stop-replica` | Issue a STOP SLAVE on an instance | `orchestrator -c stop-replica -i replica1:3306` | +| `start-replica` | Issue a START SLAVE on an instance | `orchestrator -c start-replica -i replica1:3306` | +| `restart-replica` | STOP and START SLAVE on an instance | `orchestrator -c restart-replica -i replica1:3306` | +| `reset-replica` | Issues a RESET SLAVE command; use with care | `orchestrator -c reset-replica -i replica1:3306` | +| `detach-replica-master-host` | Stops replication and modifies Master_Host to an impossible but reversible value | `orchestrator -c detach-replica-master-host -i replica1:3306` | +| `reattach-replica-master-host` | Undo a detach-replica-master-host operation | `orchestrator -c reattach-replica-master-host -i replica1:3306` | +| `master-pos-wait` | Wait until replica reaches given replication coordinates (`--binlog=file:pos`) | `orchestrator -c master-pos-wait -i replica1:3306 --binlog=mysql-bin.000003:12345` | +| `enable-semi-sync-master` | Enable semi-sync replication (master-side) | `orchestrator -c enable-semi-sync-master -i master:3306` | +| `disable-semi-sync-master` | Disable semi-sync replication (master-side) | `orchestrator -c disable-semi-sync-master -i master:3306` | +| `enable-semi-sync-replica` | Enable semi-sync replication (replica-side) | `orchestrator -c enable-semi-sync-replica -i replica1:3306` | +| `disable-semi-sync-replica` | Disable semi-sync replication (replica-side) | `orchestrator -c disable-semi-sync-replica -i replica1:3306` | +| `restart-replica-statements` | Get a list of statements to stop then restore replica to same execution state. Use `--statement` for injected statement. | `orchestrator -c restart-replica-statements -i replica1:3306` | + +### 2.7 Replication Information + +| Command | Description | Example | +|---------|-------------|---------| +| `can-replicate-from` | Can instance (`-i`) replicate from another (`-d`) per replication rules? Prints the destination if yes. | `orchestrator -c can-replicate-from -i replica1:3306 -d master:3306` | +| `is-replicating` | Is an instance actively replicating right now? | `orchestrator -c is-replicating -i replica1:3306` | +| `is-replication-stopped` | Is an instance a replica with both replication threads stopped? | `orchestrator -c is-replication-stopped -i replica1:3306` | + +### 2.8 Instance + +| Command | Description | Example | +|---------|-------------|---------| +| `set-read-only` | Turn an instance read-only via `SET GLOBAL read_only := 1` | `orchestrator -c set-read-only -i master:3306` | +| `set-writeable` | Turn an instance writeable via `SET GLOBAL read_only := 0` | `orchestrator -c set-writeable -i master:3306` | + +### 2.9 Binary Logs + +| Command | Description | Example | +|---------|-------------|---------| +| `flush-binary-logs` | Flush binary logs on an instance | `orchestrator -c flush-binary-logs -i master:3306` | +| `purge-binary-logs` | Purge binary logs of an instance (requires `--binlog`) | `orchestrator -c purge-binary-logs -i master:3306 --binlog=mysql-bin.000003` | +| `last-pseudo-gtid` | Find latest Pseudo-GTID entry in instance's binary logs | `orchestrator -c last-pseudo-gtid -i master:3306` | +| `locate-gtid-errant` | List binary logs containing errant GTIDs | `orchestrator -c locate-gtid-errant -i replica1:3306` | +| `last-executed-relay-entry` | Find coordinates of last executed relay log entry | `orchestrator -c last-executed-relay-entry -i replica1:3306` | +| `correlate-relaylog-pos` | Given an instance (`-i`) and relaylog coordinates (`--binlog=file:pos`), find correlated coordinates in another instance's relay logs (`-d`) | `orchestrator -c correlate-relaylog-pos -i replica1:3306 -d replica2:3306 --binlog=relay-bin.000003:12345` | +| `find-binlog-entry` | Get binlog file:pos of entry given by `--pattern` (exact match) in a given instance | `orchestrator -c find-binlog-entry -i master:3306 --pattern "DROP VIEW"` | +| `correlate-binlog-pos` | Given an instance (`-i`) and binlog coordinates (`--binlog=file:pos`), find correlated coordinates in another instance (`-d`) | `orchestrator -c correlate-binlog-pos -i master:3306 -d replica1:3306 --binlog=mysql-bin.000003:12345` | + +### 2.10 Pools + +| Command | Description | Example | +|---------|-------------|---------| +| `submit-pool-instances` | Submit a pool name with a list of instances in that pool | `orchestrator -c submit-pool-instances --pool mypool -i "host1:3306,host2:3306"` | +| `cluster-pool-instances` | List all pools and their associated instances | `orchestrator -c cluster-pool-instances` | +| `which-heuristic-cluster-pool-instances` | List instances of a given cluster in any or a specific pool | `orchestrator -c which-heuristic-cluster-pool-instances --alias mycluster --pool mypool` | + +### 2.11 Information + +| Command | Description | Example | +|---------|-------------|---------| +| `find` | Find instances whose hostname matches given regex pattern | `orchestrator -c find --pattern "db-prod.*"` | +| `search` | Search instances by name, version, version comment, port | `orchestrator -c search --pattern "5.7"` | +| `clusters` | List all clusters known to orchestrator | `orchestrator -c clusters` | +| `clusters-alias` | List all clusters with their aliases | `orchestrator -c clusters-alias` | +| `all-clusters-masters` | List writeable masters, one per cluster | `orchestrator -c all-clusters-masters` | +| `topology` | Show an ASCII-graph of a replication topology | `orchestrator -c topology -i master:3306` or `orchestrator -c topology --alias mycluster` | +| `topology-tabulated` | Show an ASCII-graph of a replication topology (tabulated format) | `orchestrator -c topology-tabulated --alias mycluster` | +| `topology-tags` | Show an ASCII-graph of a replication topology with instance tags | `orchestrator -c topology-tags --alias mycluster` | +| `all-instances` | The complete list of known instances | `orchestrator -c all-instances` | +| `which-instance` | Output the fully-qualified hostname:port of the given instance | `orchestrator -c which-instance -i host:3306` | +| `which-cluster` | Output the cluster name an instance belongs to | `orchestrator -c which-cluster -i host:3306` | +| `which-cluster-alias` | Output the alias of the cluster an instance belongs to | `orchestrator -c which-cluster-alias -i host:3306` | +| `which-cluster-domain` | Output the domain name of the cluster an instance belongs to | `orchestrator -c which-cluster-domain -i host:3306` | +| `which-heuristic-domain-instance` | Returns the instance associated as writer with a cluster's domain name | `orchestrator -c which-heuristic-domain-instance --alias mycluster` | +| `which-cluster-master` | Output the name of the master in a given cluster | `orchestrator -c which-cluster-master --alias mycluster` | +| `which-cluster-instances` | Output the list of instances in the same cluster | `orchestrator -c which-cluster-instances --alias mycluster` | +| `which-cluster-osc-replicas` | Output replicas in a cluster suitable for pt-online-schema-change | `orchestrator -c which-cluster-osc-replicas --alias mycluster` | +| `which-cluster-gh-ost-replicas` | Output replicas in a cluster suitable as a gh-ost working server | `orchestrator -c which-cluster-gh-ost-replicas --alias mycluster` | +| `which-master` | Output the hostname:port of a given instance's master | `orchestrator -c which-master -i replica1:3306` | +| `which-downtimed-instances` | List instances currently downtimed, optionally filtered by cluster | `orchestrator -c which-downtimed-instances --alias mycluster` | +| `which-replicas` | Output the hostname:port list of replicas of a given instance | `orchestrator -c which-replicas -i master:3306` | +| `which-lost-in-recovery` | List instances marked as downtimed for being lost in a recovery process | `orchestrator -c which-lost-in-recovery` | +| `instance-status` | Output short status on a given instance | `orchestrator -c instance-status -i host:3306` | +| `get-cluster-heuristic-lag` | Output a heuristic representative lag for a given cluster | `orchestrator -c get-cluster-heuristic-lag --alias mycluster` | + +### 2.12 Key-Value Stores + +| Command | Description | Example | +|---------|-------------|---------| +| `submit-masters-to-kv-stores` | Submit master of a specific cluster, or all masters to key-value stores | `orchestrator -c submit-masters-to-kv-stores --alias mycluster` | + +### 2.13 Tags + +| Command | Description | Example | +|---------|-------------|---------| +| `tags` | List tags for a given instance | `orchestrator -c tags -i host:3306` | +| `tag-value` | Get tag value for a specific instance (requires `--tag`) | `orchestrator -c tag-value -i host:3306 --tag "role"` | +| `tagged` | List instances tagged by tag-string. Format: `"tagname"` or `"tagname=tagvalue"` or comma-separated for intersection. | `orchestrator -c tagged --tag "role=primary"` | +| `tag` | Add a tag to a given instance (requires `--tag`) | `orchestrator -c tag -i host:3306 --tag "role=primary"` | +| `untag` | Remove a tag from an instance (requires `--tag`) | `orchestrator -c untag -i host:3306 --tag "role"` | +| `untag-all` | Remove a tag from all matching instances (requires `--tag`) | `orchestrator -c untag-all --tag "role"` | + +### 2.14 Instance Management + +| Command | Description | Example | +|---------|-------------|---------| +| `discover` | Lookup an instance, investigate it | `orchestrator -c discover -i host:3306` | +| `forget` | Forget about an instance's existence | `orchestrator -c forget -i host:3306` | +| `begin-maintenance` | Request a maintenance lock on an instance (requires `--reason`) | `orchestrator -c begin-maintenance -i host:3306 --reason "hardware upgrade" --duration 1h` | +| `end-maintenance` | Remove maintenance lock from an instance | `orchestrator -c end-maintenance -i host:3306` | +| `in-maintenance` | Check whether instance is under maintenance | `orchestrator -c in-maintenance -i host:3306` | +| `begin-downtime` | Mark an instance as downtimed (requires `--reason`) | `orchestrator -c begin-downtime -i host:3306 --reason "planned maintenance" --duration 2h` | +| `end-downtime` | Indicate an instance is no longer downtimed | `orchestrator -c end-downtime -i host:3306` | + +### 2.15 Recovery + +| Command | Description | Example | +|---------|-------------|---------| +| `recover` | Do auto-recovery given a dead instance | `orchestrator -c recover -i dead-master:3306` | +| `recover-lite` | Do auto-recovery without executing external processes | `orchestrator -c recover-lite -i dead-master:3306` | +| `force-master-failover` | Forcibly discard master and initiate failover, even if no problem detected. Orchestrator chooses the replacement. | `orchestrator -c force-master-failover --alias mycluster` | +| `force-master-takeover` | Forcibly discard master and promote specified instance (`-d`) | `orchestrator -c force-master-takeover --alias mycluster -d newmaster:3306` | +| `graceful-master-takeover` | Gracefully promote a new master. Specify identity via `-d` or have a single direct replica. | `orchestrator -c graceful-master-takeover --alias mycluster -d newmaster:3306` | +| `graceful-master-takeover-auto` | Gracefully promote a new master. Orchestrator attempts to pick the replica automatically. | `orchestrator -c graceful-master-takeover-auto --alias mycluster` | +| `replication-analysis` | Request an analysis of potential crash incidents in all known topologies | `orchestrator -c replication-analysis` | +| `ack-all-recoveries` | Acknowledge all recoveries; unblocks pending future recoveries (requires `--reason`) | `orchestrator -c ack-all-recoveries --reason "all clear"` | +| `ack-cluster-recoveries` | Acknowledge recoveries for a given cluster (requires `--reason`) | `orchestrator -c ack-cluster-recoveries --alias mycluster --reason "resolved"` | +| `ack-instance-recoveries` | Acknowledge recoveries for a given instance (requires `--reason`) | `orchestrator -c ack-instance-recoveries -i host:3306 --reason "resolved"` | + +### 2.16 Instance Meta + +| Command | Description | Example | +|---------|-------------|---------| +| `register-candidate` | Indicate that an instance is a preferred candidate for master promotion | `orchestrator -c register-candidate -i replica1:3306 --promotion-rule prefer` | +| `register-hostname-unresolve` | Assigns the given instance a virtual ("unresolved") name | `orchestrator -c register-hostname-unresolve -i host:3306 --hostname virtualname` | +| `deregister-hostname-unresolve` | Deregister/disassociate a hostname with an "unresolved" name | `orchestrator -c deregister-hostname-unresolve -i host:3306` | +| `set-heuristic-domain-instance` | Associate domain name of given cluster with the writer master | `orchestrator -c set-heuristic-domain-instance --alias mycluster` | + +### 2.17 Meta / Orchestrator Operations + +| Command | Description | Example | +|---------|-------------|---------| +| `snapshot-topologies` | Take a snapshot of existing topologies | `orchestrator -c snapshot-topologies` | +| `continuous` | Enter continuous mode: actively poll for instances, diagnose problems, do maintenance | `orchestrator -c continuous` | +| `active-nodes` | List currently active orchestrator nodes | `orchestrator -c active-nodes` | +| `access-token` | Get an HTTP access token | `orchestrator -c access-token` | +| `resolve` | Resolve given hostname | `orchestrator -c resolve -i host:3306` | +| `reset-hostname-resolve-cache` | Clear the hostname resolve cache | `orchestrator -c reset-hostname-resolve-cache` | +| `dump-config` | Print out configuration in JSON format | `orchestrator -c dump-config` | +| `show-resolve-hosts` | Show the content of the hostname_resolve table (debugging) | `orchestrator -c show-resolve-hosts` | +| `show-unresolve-hosts` | Show the content of the hostname_unresolve table (debugging) | `orchestrator -c show-unresolve-hosts` | +| `redeploy-internal-db` | Force internal schema migration to current backend structure | `orchestrator -c redeploy-internal-db` | +| `internal-suggest-promoted-replacement` | Internal only, used to test promotion logic in CI | `orchestrator -c internal-suggest-promoted-replacement -i old:3306 -d candidate:3306` | + +### 2.18 Global Recoveries + +| Command | Description | Example | +|---------|-------------|---------| +| `disable-global-recoveries` | Disallow orchestrator from performing recoveries globally | `orchestrator -c disable-global-recoveries` | +| `enable-global-recoveries` | Allow orchestrator to perform recoveries globally | `orchestrator -c enable-global-recoveries` | +| `check-global-recoveries` | Show the global recovery configuration | `orchestrator -c check-global-recoveries` | + +### 2.19 Bulk Operations + +| Command | Description | Example | +|---------|-------------|---------| +| `bulk-instances` | Return a sorted list of instance names known to orchestrator | `orchestrator -c bulk-instances` | +| `bulk-promotion-rules` | Return a list of promotion rules known to orchestrator | `orchestrator -c bulk-promotion-rules` | + +### 2.20 ProxySQL + +| Command | Description | Example | +|---------|-------------|---------| +| `proxysql-test` | Test connectivity to ProxySQL Admin interface | `orchestrator -c proxysql-test` | +| `proxysql-servers` | Show `mysql_servers` from ProxySQL | `orchestrator -c proxysql-servers` | + +### 2.21 Agent + +| Command | Description | Example | +|---------|-------------|---------| +| `custom-command` | Execute a custom command on the agent as defined in the agent conf | `orchestrator -c custom-command --hostname agenthost --pattern commandname` | + +--- + +## 3. API v1 Reference + +All v1 API endpoints are accessible under `/api/`. All endpoints return JSON. When `URLPrefix` is configured, endpoints are at `//api/`. + +Base URL: `http://:3000/api/` + +### 3.1 Smart Relocation + +| Endpoint | Description | +|----------|-------------| +| `GET /api/relocate/{host}/{port}/{belowHost}/{belowPort}` | Relocate a replica beneath another instance | +| `GET /api/relocate-below/{host}/{port}/{belowHost}/{belowPort}` | Same as relocate | +| `GET /api/relocate-slaves/{host}/{port}/{belowHost}/{belowPort}` | Relocate all replicas of an instance to another | + +### 3.2 Classic file:pos Relocation + +| Endpoint | Description | +|----------|-------------| +| `GET /api/move-up/{host}/{port}` | Move a replica one level up | +| `GET /api/move-up-slaves/{host}/{port}` | Move replicas of an instance one level up | +| `GET /api/move-below/{host}/{port}/{siblingHost}/{siblingPort}` | Move a replica beneath its sibling | +| `GET /api/move-equivalent/{host}/{port}/{belowHost}/{belowPort}` | Move via equivalence coordinates | +| `GET /api/repoint/{host}/{port}/{belowHost}/{belowPort}` | Repoint without changing binlog coordinates | +| `GET /api/repoint-slaves/{host}/{port}` | Repoint all replicas | +| `GET /api/make-co-master/{host}/{port}` | Create co-master replication | +| `GET /api/enslave-siblings/{host}/{port}` | Turn siblings into sub-replicas | +| `GET /api/enslave-master/{host}/{port}` | Take over the master role | +| `GET /api/master-equivalent/{host}/{port}/{logFile}/{logPos}` | Find equivalent coordinates on master | + +### 3.3 Binlog Server + +| Endpoint | Description | +|----------|-------------| +| `GET /api/regroup-slaves-bls/{host}/{port}` | Regroup Binlog Server replicas | + +### 3.4 GTID Relocation + +| Endpoint | Description | +|----------|-------------| +| `GET /api/move-below-gtid/{host}/{port}/{belowHost}/{belowPort}` | Move using GTID | +| `GET /api/move-slaves-gtid/{host}/{port}/{belowHost}/{belowPort}` | Move replicas using GTID | +| `GET /api/regroup-slaves-gtid/{host}/{port}` | Regroup replicas using GTID | + +### 3.5 Pseudo-GTID Relocation + +| Endpoint | Description | +|----------|-------------| +| `GET /api/match/{host}/{port}/{belowHost}/{belowPort}` | Match using Pseudo-GTID | +| `GET /api/match-below/{host}/{port}/{belowHost}/{belowPort}` | Same as match | +| `GET /api/match-up/{host}/{port}` | Match-up one level using Pseudo-GTID | +| `GET /api/match-slaves/{host}/{port}/{belowHost}/{belowPort}` | Match all replicas using Pseudo-GTID | +| `GET /api/match-up-slaves/{host}/{port}` | Match replicas up one level using Pseudo-GTID | +| `GET /api/regroup-slaves-pgtid/{host}/{port}` | Regroup replicas using Pseudo-GTID | + +### 3.6 Topology / Promotion + +| Endpoint | Description | +|----------|-------------| +| `GET /api/make-master/{host}/{port}` | Promote an instance to master | +| `GET /api/make-local-master/{host}/{port}` | Promote an instance to local master | +| `GET /api/regroup-slaves/{host}/{port}` | Regroup replicas (smart mode) | + +### 3.7 Replication Control + +| Endpoint | Description | +|----------|-------------| +| `GET /api/enable-gtid/{host}/{port}` | Enable GTID replication | +| `GET /api/disable-gtid/{host}/{port}` | Disable GTID replication | +| `GET /api/locate-gtid-errant/{host}/{port}` | Locate errant GTID entries | +| `GET /api/gtid-errant-reset-master/{host}/{port}` | Reset master to clear errant GTIDs | +| `GET /api/gtid-errant-inject-empty/{host}/{port}` | Inject empty transactions for errant GTIDs | +| `GET /api/skip-query/{host}/{port}` | Skip a query on a replica | +| `GET /api/start-slave/{host}/{port}` | Start replication | +| `GET /api/restart-slave/{host}/{port}` | Restart replication | +| `GET /api/stop-slave/{host}/{port}` | Stop replication | +| `GET /api/stop-slave-nice/{host}/{port}` | Stop replication nicely | +| `GET /api/reset-slave/{host}/{port}` | Reset replication | +| `GET /api/detach-slave/{host}/{port}` | Detach replica master host | +| `GET /api/reattach-slave/{host}/{port}` | Reattach replica master host | +| `GET /api/detach-slave-master-host/{host}/{port}` | Detach replica master host | +| `GET /api/reattach-slave-master-host/{host}/{port}` | Reattach replica master host | +| `GET /api/flush-binary-logs/{host}/{port}` | Flush binary logs | +| `GET /api/purge-binary-logs/{host}/{port}/{logFile}` | Purge binary logs to given file | +| `GET /api/restart-slave-statements/{host}/{port}` | Get restart replica statements | +| `GET /api/enable-semi-sync-master/{host}/{port}` | Enable semi-sync (master-side) | +| `GET /api/disable-semi-sync-master/{host}/{port}` | Disable semi-sync (master-side) | +| `GET /api/enable-semi-sync-replica/{host}/{port}` | Enable semi-sync (replica-side) | +| `GET /api/disable-semi-sync-replica/{host}/{port}` | Disable semi-sync (replica-side) | +| `GET /api/delay-replication/{host}/{port}/{seconds}` | Set replication delay | + +### 3.8 Replication Information + +| Endpoint | Description | +|----------|-------------| +| `GET /api/can-replicate-from/{host}/{port}/{belowHost}/{belowPort}` | Check if replication is possible | +| `GET /api/can-replicate-from-gtid/{host}/{port}/{belowHost}/{belowPort}` | Check GTID-based replication possibility | + +### 3.9 Instance Control + +| Endpoint | Description | +|----------|-------------| +| `GET /api/set-read-only/{host}/{port}` | Set instance read-only | +| `GET /api/set-writeable/{host}/{port}` | Set instance writeable | +| `GET /api/kill-query/{host}/{port}/{process}` | Kill a specific query | + +### 3.10 Binary Logs + +| Endpoint | Description | +|----------|-------------| +| `GET /api/last-pseudo-gtid/{host}/{port}` | Find last Pseudo-GTID entry | + +### 3.11 Pools + +| Endpoint | Description | +|----------|-------------| +| `GET /api/submit-pool-instances/{pool}` | Submit pool instances | +| `GET /api/cluster-pool-instances/{clusterName}` | List pool instances for a cluster | +| `GET /api/cluster-pool-instances/{clusterName}/{pool}` | List instances in a specific pool for a cluster | +| `GET /api/heuristic-cluster-pool-instances/{clusterName}` | Heuristic pool instances for a cluster | +| `GET /api/heuristic-cluster-pool-instances/{clusterName}/{pool}` | Heuristic instances for a specific pool | +| `GET /api/heuristic-cluster-pool-lag/{clusterName}` | Heuristic pool lag for a cluster | +| `GET /api/heuristic-cluster-pool-lag/{clusterName}/{pool}` | Heuristic lag for a specific pool | + +### 3.12 Search and Discovery + +| Endpoint | Description | +|----------|-------------| +| `GET /api/search/{searchString}` | Search instances by various attributes | +| `GET /api/search` | Search instances (empty search returns all) | +| `GET /api/instance/{host}/{port}` | Get instance details | +| `GET /api/discover/{host}/{port}` | Discover an instance | +| `GET /api/async-discover/{host}/{port}` | Asynchronously discover an instance | +| `GET /api/refresh/{host}/{port}` | Refresh instance data | +| `GET /api/forget/{host}/{port}` | Forget an instance | +| `GET /api/forget-cluster/{clusterHint}` | Forget an entire cluster | + +### 3.13 Cluster Information + +| Endpoint | Description | +|----------|-------------| +| `GET /api/cluster/{clusterHint}` | Get cluster instances | +| `GET /api/cluster/alias/{clusterAlias}` | Get cluster by alias | +| `GET /api/cluster/instance/{host}/{port}` | Get cluster by instance | +| `GET /api/cluster-info/{clusterHint}` | Get cluster info | +| `GET /api/cluster-info/alias/{clusterAlias}` | Get cluster info by alias | +| `GET /api/cluster-osc-slaves/{clusterHint}` | Get OSC replicas | +| `GET /api/set-cluster-alias/{clusterName}` | Set a manual cluster alias override | +| `GET /api/clusters` | List all clusters | +| `GET /api/clusters-info` | List all clusters with info | +| `GET /api/masters` | List all masters | +| `GET /api/master/{clusterHint}` | Get master for a cluster | +| `GET /api/instance-replicas/{host}/{port}` | List replicas of an instance | +| `GET /api/all-instances` | List all instances | +| `GET /api/downtimed` | List all downtimed instances | +| `GET /api/downtimed/{clusterHint}` | List downtimed instances for a cluster | +| `GET /api/topology/{clusterHint}` | ASCII topology for a cluster | +| `GET /api/topology/{host}/{port}` | ASCII topology via instance | +| `GET /api/topology-tabulated/{clusterHint}` | Tabulated ASCII topology | +| `GET /api/topology-tabulated/{host}/{port}` | Tabulated ASCII topology via instance | +| `GET /api/topology-tags/{clusterHint}` | ASCII topology with tags | +| `GET /api/topology-tags/{host}/{port}` | ASCII topology with tags via instance | +| `GET /api/snapshot-topologies` | Snapshot all topologies | + +### 3.14 Tags + +| Endpoint | Description | +|----------|-------------| +| `GET /api/tagged` | List instances matching tag query | +| `GET /api/tags/{host}/{port}` | List tags for an instance | +| `GET /api/tag-value/{host}/{port}` | Get tag value | +| `GET /api/tag-value/{host}/{port}/{tagName}` | Get specific tag value | +| `GET /api/tag/{host}/{port}` | Set a tag | +| `GET /api/tag/{host}/{port}/{tagName}/{tagValue}` | Set a tag with name and value | +| `GET /api/untag/{host}/{port}` | Remove a tag | +| `GET /api/untag/{host}/{port}/{tagName}` | Remove a specific tag | +| `GET /api/untag-all` | Remove tag from all instances | +| `GET /api/untag-all/{tagName}/{tagValue}` | Remove specific tag from all instances | + +### 3.15 Instance Management + +| Endpoint | Description | +|----------|-------------| +| `GET /api/begin-maintenance/{host}/{port}/{owner}/{reason}` | Begin maintenance on an instance | +| `GET /api/end-maintenance/{host}/{port}` | End maintenance by instance key | +| `GET /api/in-maintenance/{host}/{port}` | Check if instance is in maintenance | +| `GET /api/end-maintenance/{maintenanceKey}` | End maintenance by maintenance key | +| `GET /api/maintenance` | List all active maintenance entries | +| `GET /api/begin-downtime/{host}/{port}/{owner}/{reason}` | Begin downtime | +| `GET /api/begin-downtime/{host}/{port}/{owner}/{reason}/{duration}` | Begin downtime with duration | +| `GET /api/end-downtime/{host}/{port}` | End downtime | + +### 3.16 Recovery and Analysis + +| Endpoint | Description | +|----------|-------------| +| `GET /api/replication-analysis` | Get replication analysis for all topologies | +| `GET /api/replication-analysis/{clusterName}` | Analysis for a specific cluster | +| `GET /api/replication-analysis/instance/{host}/{port}` | Analysis for a specific instance | +| `GET /api/recover/{host}/{port}` | Initiate recovery | +| `GET /api/recover/{host}/{port}/{candidateHost}/{candidatePort}` | Recover with candidate | +| `GET /api/recover-lite/{host}/{port}` | Recover without external processes | +| `GET /api/recover-lite/{host}/{port}/{candidateHost}/{candidatePort}` | Recover-lite with candidate | +| `GET /api/graceful-master-takeover/{host}/{port}` | Graceful master takeover | +| `GET /api/graceful-master-takeover/{host}/{port}/{designatedHost}/{designatedPort}` | Graceful takeover with designated | +| `GET /api/graceful-master-takeover/{clusterHint}` | Graceful takeover by cluster | +| `GET /api/graceful-master-takeover/{clusterHint}/{designatedHost}/{designatedPort}` | Graceful takeover by cluster with designated | +| `GET /api/graceful-master-takeover-auto/{host}/{port}` | Auto graceful takeover | +| `GET /api/graceful-master-takeover-auto/{host}/{port}/{designatedHost}/{designatedPort}` | Auto takeover with designated | +| `GET /api/graceful-master-takeover-auto/{clusterHint}` | Auto takeover by cluster | +| `GET /api/graceful-master-takeover-auto/{clusterHint}/{designatedHost}/{designatedPort}` | Auto takeover by cluster with designated | +| `GET /api/force-master-failover/{host}/{port}` | Force master failover | +| `GET /api/force-master-failover/{clusterHint}` | Force failover by cluster | +| `GET /api/force-master-takeover/{clusterHint}/{designatedHost}/{designatedPort}` | Force takeover by cluster | +| `GET /api/force-master-takeover/{host}/{port}/{designatedHost}/{designatedPort}` | Force takeover with specific instance | +| `GET /api/register-candidate/{host}/{port}/{promotionRule}` | Register promotion candidate | +| `GET /api/automated-recovery-filters` | Get recovery filters | +| `GET /api/audit-failure-detection` | Audit failure detections | +| `GET /api/audit-failure-detection/{page}` | Audit failure detections (paginated) | +| `GET /api/audit-failure-detection/id/{id}` | Audit failure detection by ID | +| `GET /api/audit-failure-detection/alias/{clusterAlias}` | Audit failure detection by alias | +| `GET /api/audit-failure-detection/alias/{clusterAlias}/{page}` | Audit failure detection by alias (paginated) | +| `GET /api/replication-analysis-changelog` | Replication analysis changelog | +| `GET /api/audit-recovery` | Audit recovery operations | +| `GET /api/audit-recovery/{page}` | Audit recovery (paginated) | +| `GET /api/audit-recovery/id/{id}` | Audit recovery by ID | +| `GET /api/audit-recovery/uid/{uid}` | Audit recovery by UID | +| `GET /api/audit-recovery/cluster/{clusterName}` | Audit recovery by cluster | +| `GET /api/audit-recovery/cluster/{clusterName}/{page}` | Audit recovery by cluster (paginated) | +| `GET /api/audit-recovery/alias/{clusterAlias}` | Audit recovery by alias | +| `GET /api/audit-recovery/alias/{clusterAlias}/{page}` | Audit recovery by alias (paginated) | +| `GET /api/audit-recovery-steps/{uid}` | Get recovery steps by UID | +| `GET /api/active-cluster-recovery/{clusterName}` | Active recoveries for a cluster | +| `GET /api/recently-active-cluster-recovery/{clusterName}` | Recently active recoveries for a cluster | +| `GET /api/recently-active-instance-recovery/{host}/{port}` | Recently active recoveries for an instance | +| `GET /api/ack-recovery/cluster/{clusterHint}` | Acknowledge cluster recovery | +| `GET /api/ack-recovery/cluster/alias/{clusterAlias}` | Acknowledge recovery by cluster alias | +| `GET /api/ack-recovery/instance/{host}/{port}` | Acknowledge instance recovery | +| `GET /api/ack-recovery/{recoveryId}` | Acknowledge recovery by ID | +| `GET /api/ack-recovery/uid/{uid}` | Acknowledge recovery by UID | +| `GET /api/ack-all-recoveries` | Acknowledge all recoveries | +| `GET /api/blocked-recoveries` | List blocked recoveries | +| `GET /api/blocked-recoveries/cluster/{clusterName}` | List blocked recoveries for a cluster | +| `GET /api/disable-global-recoveries` | Disable recoveries globally | +| `GET /api/enable-global-recoveries` | Enable recoveries globally | +| `GET /api/check-global-recoveries` | Check global recovery status | + +### 3.17 Problems and Audit + +| Endpoint | Description | +|----------|-------------| +| `GET /api/problems` | List all detected problems | +| `GET /api/problems/{clusterName}` | List problems for a cluster | +| `GET /api/audit` | Audit log | +| `GET /api/audit/{page}` | Audit log (paginated) | +| `GET /api/audit/instance/{host}/{port}` | Audit log for an instance | +| `GET /api/audit/instance/{host}/{port}/{page}` | Audit log for an instance (paginated) | +| `GET /api/resolve/{host}/{port}` | Resolve hostname | + +### 3.18 Health and Raft + +These endpoints do NOT proxy through the raft leader. + +| Endpoint | Description | +|----------|-------------| +| `GET /api/headers` | Show request headers (for auth debugging) | +| `GET /api/health` | Health check | +| `GET /api/lb-check` | Load-balancer health check | +| `GET /api/_ping` | Same as `lb-check` | +| `GET /api/leader-check` | Returns 200 if this node is the leader | +| `GET /api/leader-check/{errorStatusCode}` | Leader check with custom error status code | +| `GET /api/grab-election` | Grab leadership election | +| `GET /api/raft-add-peer/{addr}` | Add a raft peer (proxied to leader) | +| `GET /api/raft-remove-peer/{addr}` | Remove a raft peer (proxied to leader) | +| `GET /api/raft-yield/{node}` | Yield raft leadership to a specific node | +| `GET /api/raft-yield-hint/{hint}` | Yield raft leadership with hint | +| `GET /api/raft-peers` | List raft peers | +| `GET /api/raft-state` | Get raft state | +| `GET /api/raft-leader` | Get current raft leader | +| `GET /api/raft-health` | Raft health check | +| `GET /api/raft-status` | Raft status | +| `GET /api/raft-snapshot` | Trigger raft snapshot | +| `GET /api/raft-follower-health-report/{authenticationToken}/{raftBind}/{raftAdvertise}` | Raft follower health report | +| `GET /api/reload-configuration` | Reload configuration from file | +| `GET /api/hostname-resolve-cache` | Show hostname resolve cache | +| `GET /api/reset-hostname-resolve-cache` | Reset hostname resolve cache | + +### 3.19 Hostname and Configuration + +| Endpoint | Description | +|----------|-------------| +| `GET /api/routed-leader-check` | Leader check (proxied through raft) | +| `GET /api/reelect` | Trigger re-election | +| `GET /api/reload-cluster-alias` | Reload cluster alias configuration | +| `GET /api/deregister-hostname-unresolve/{host}/{port}` | Deregister hostname unresolve | +| `GET /api/register-hostname-unresolve/{host}/{port}/{virtualname}` | Register hostname unresolve | + +### 3.20 Bulk Operations + +| Endpoint | Description | +|----------|-------------| +| `GET /api/bulk-instances` | Sorted list of all instance names | +| `GET /api/bulk-promotion-rules` | List of all promotion rules | + +### 3.21 Discovery Metrics + +| Endpoint | Description | +|----------|-------------| +| `GET /api/discovery-metrics-raw/{seconds}` | Raw discovery metrics | +| `GET /api/discovery-metrics-aggregated/{seconds}` | Aggregated discovery metrics | +| `GET /api/discovery-queue-metrics-raw/{seconds}` | Raw discovery queue metrics | +| `GET /api/discovery-queue-metrics-aggregated/{seconds}` | Aggregated discovery queue metrics | +| `GET /api/discovery-queue-metrics-raw/{queue}/{seconds}` | Raw metrics for a specific queue | +| `GET /api/discovery-queue-metrics-aggregated/{queue}/{seconds}` | Aggregated metrics for a specific queue | +| `GET /api/backend-query-metrics-raw/{seconds}` | Raw backend query metrics | +| `GET /api/backend-query-metrics-aggregated/{seconds}` | Aggregated backend query metrics | +| `GET /api/write-buffer-metrics-raw/{seconds}` | Raw write buffer metrics | +| `GET /api/write-buffer-metrics-aggregated/{seconds}` | Aggregated write buffer metrics | + +### 3.22 Agents + +| Endpoint | Description | +|----------|-------------| +| `GET /api/agents` | List all agents | +| `GET /api/agent/{host}` | Get agent details | +| `GET /api/agent-umount/{host}` | Unmount agent LV | +| `GET /api/agent-mount/{host}` | Mount agent LV | +| `GET /api/agent-create-snapshot/{host}` | Create LVM snapshot | +| `GET /api/agent-removelv/{host}` | Remove LVM logical volume | +| `GET /api/agent-mysql-stop/{host}` | Stop MySQL on agent | +| `GET /api/agent-mysql-start/{host}` | Start MySQL on agent | +| `GET /api/agent-seed/{targetHost}/{sourceHost}` | Seed (clone) from source to target | +| `GET /api/agent-active-seeds/{host}` | Active seeds for a host | +| `GET /api/agent-recent-seeds/{host}` | Recent seeds for a host | +| `GET /api/agent-seed-details/{seedId}` | Seed details | +| `GET /api/agent-seed-states/{seedId}` | Seed states | +| `GET /api/agent-abort-seed/{seedId}` | Abort a seed | +| `GET /api/agent-custom-command/{host}/{command}` | Execute custom agent command | +| `GET /api/seeds` | List all seeds | + +### 3.23 ProxySQL + +| Endpoint | Description | +|----------|-------------| +| `GET /api/proxysql/servers` | List all servers from ProxySQL `runtime_mysql_servers` | +| `GET /api/proxysql/servers/{hostgroup}` | List servers filtered by hostgroup ID | + +### 3.24 Status + +| Endpoint | Description | +|----------|-------------| +| `GET /api/status` | Status check (when `StatusEndpoint` is configured) | + +### 3.25 KV Stores + +| Endpoint | Description | +|----------|-------------| +| `GET /api/submit-masters-to-kv-stores` | Submit all masters to KV stores | +| `GET /api/submit-masters-to-kv-stores/{clusterHint}` | Submit specific cluster master to KV stores | + +--- + +## 4. API v2 Reference + +API v2 uses structured JSON envelopes with consistent response format. All v2 endpoints are under `/api/v2/` (respects `URLPrefix`). + +### Response Envelope + +All v2 responses use this structure: + +```json +{ + "status": "ok", + "data": { ... }, + "message": "" +} +``` + +Error responses: + +```json +{ + "status": "error", + "error": { + "code": "ERROR_CODE", + "message": "Human-readable message" + } +} +``` + +### Endpoints + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/api/v2/clusters` | List all known clusters with metadata | +| `GET` | `/api/v2/clusters/{name}` | Detailed information about a specific cluster | +| `GET` | `/api/v2/clusters/{name}/instances` | All instances belonging to a given cluster | +| `GET` | `/api/v2/clusters/{name}/topology` | ASCII topology representation for a cluster | +| `GET` | `/api/v2/instances/{host}/{port}` | Detailed information about a specific MySQL instance | +| `GET` | `/api/v2/recoveries` | Recent recovery entries. Query params: `cluster`, `alias`, `page`. | +| `GET` | `/api/v2/recoveries/active` | Currently active (in-progress) recoveries | +| `GET` | `/api/v2/status` | Health status of the orchestrator node | +| `GET` | `/api/v2/proxysql/servers` | All servers from ProxySQL `runtime_mysql_servers` table | + +### Example Requests + +```bash +# List all clusters +curl http://localhost:3000/api/v2/clusters + +# Get cluster detail +curl http://localhost:3000/api/v2/clusters/mycluster + +# Get instances in a cluster +curl http://localhost:3000/api/v2/clusters/mycluster/instances + +# Get instance detail +curl http://localhost:3000/api/v2/instances/db1.example.com/3306 + +# Get recent recoveries filtered by cluster +curl "http://localhost:3000/api/v2/recoveries?cluster=mycluster&page=0" + +# Get active recoveries +curl http://localhost:3000/api/v2/recoveries/active + +# Health status +curl http://localhost:3000/api/v2/status + +# ProxySQL servers +curl http://localhost:3000/api/v2/proxysql/servers +``` + +--- + +## 5. ProxySQL Configuration + +Orchestrator has built-in support for updating ProxySQL hostgroups during failover. When configured, orchestrator automatically drains the old master and promotes the new master in ProxySQL without custom scripts. + +### Configuration Fields + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `ProxySQLAdminAddress` | string | `""` | ProxySQL Admin host. Leave empty to disable all ProxySQL hooks. | +| `ProxySQLAdminPort` | int | `6032` | ProxySQL Admin port | +| `ProxySQLAdminUser` | string | `"admin"` | Admin interface username | +| `ProxySQLAdminPassword` | string | `""` | Admin interface password | +| `ProxySQLAdminUseTLS` | bool | `false` | Use TLS for Admin connection | +| `ProxySQLWriterHostgroup` | int | `0` | Writer hostgroup ID. Must be > 0 to enable hooks. | +| `ProxySQLReaderHostgroup` | int | `0` | Reader hostgroup ID. Optional. | +| `ProxySQLPreFailoverAction` | string | `"offline_soft"` | Action on old master before failover | + +### Minimal Configuration Example + +```json +{ + "ProxySQLAdminAddress": "127.0.0.1", + "ProxySQLAdminPort": 6032, + "ProxySQLAdminUser": "admin", + "ProxySQLAdminPassword": "admin", + "ProxySQLWriterHostgroup": 10, + "ProxySQLReaderHostgroup": 20, + "ProxySQLPreFailoverAction": "offline_soft" +} +``` + +### Pre-Failover Actions + +| Action | Behavior | +|--------|----------| +| `offline_soft` | Sets old master's status to `OFFLINE_SOFT`. Existing connections complete; no new ones are routed. | +| `weight_zero` | Sets old master's weight to 0. Similar effect but preserves the status field. | +| `none` | No pre-failover ProxySQL update. | + +### Post-Failover Behavior + +1. Old master is removed from the writer hostgroup +2. New master is added to the writer hostgroup +3. If reader hostgroup is configured: new master is removed from readers +4. If reader hostgroup is configured: old master is added to reader hostgroup as `OFFLINE_SOFT` +5. `LOAD MYSQL SERVERS TO RUNTIME` is executed +6. `SAVE MYSQL SERVERS TO DISK` is executed + +### Failover Timeline Integration + +``` +Dead master detected + -> OnFailureDetectionProcesses (scripts) + -> PreFailoverProcesses (scripts) + -> ProxySQL pre-failover: drain old master + -> [topology manipulation: elect new master] + -> KV store updates (Consul/ZK) + -> ProxySQL post-failover: promote new master + -> PostMasterFailoverProcesses (scripts) + -> PostFailoverProcesses (scripts) +``` + +ProxySQL hooks run alongside existing script-based hooks. They are non-blocking: if ProxySQL is unreachable, the failover proceeds normally. Post-failover errors are logged but do not mark the recovery as failed. + +### CLI Commands + +```bash +# Test ProxySQL connectivity +orchestrator -c proxysql-test + +# Show ProxySQL server list +orchestrator -c proxysql-servers +``` + +### API Endpoints + +```bash +# List all servers +GET /api/proxysql/servers + +# List servers by hostgroup +GET /api/proxysql/servers/:hostgroup + +# V2 endpoint +GET /api/v2/proxysql/servers +``` + +### Multiple ProxySQL Instances + +For ProxySQL Cluster deployments, configure orchestrator to connect to one ProxySQL node. Changes propagate automatically via ProxySQL's cluster synchronization. For non-cluster setups, use `PostMasterFailoverProcesses` script hooks for additional ProxySQL instances. + +--- + +## 6. Observability + +### Prometheus Metrics + +When `PrometheusEnabled` is `true` (default), orchestrator exposes a `/metrics` endpoint in Prometheus scraping format. + +| Metric | Type | Description | +|--------|------|-------------| +| `orchestrator_discoveries_total` | Counter | Total number of discovery attempts | +| `orchestrator_discovery_errors_total` | Counter | Total number of failed discoveries | +| `orchestrator_instances_total` | Gauge | Total number of known instances | +| `orchestrator_clusters_total` | Gauge | Total number of known clusters | +| `orchestrator_recoveries_total` | Counter | Recovery attempts (labels: `type`, `result`) | +| `orchestrator_recovery_duration_seconds` | Histogram | Duration of recovery operations | + +#### Prometheus Scrape Configuration + +```yaml +scrape_configs: + - job_name: orchestrator + static_configs: + - targets: ['orchestrator:3000'] + metrics_path: /metrics + scrape_interval: 15s +``` + +### Health Check Endpoints + +| Endpoint | Purpose | Success | Failure | +|----------|---------|---------|---------| +| `GET /health/live` | Liveness probe. Returns 200 if process is running. | `{"status": "alive"}` | N/A (process down) | +| `GET /health/ready` | Readiness probe. Returns 200 if backend DB is connected and health checks pass. | `{"status": "ready"}` | 503 `{"status": "not ready"}` | +| `GET /health/leader` | Leader check. Returns 200 if this is the raft leader or active node. | `{"status": "leader"}` | 503 `{"status": "not leader"}` | + +### Additional Health Endpoints (API v1) + +| Endpoint | Purpose | +|----------|---------| +| `GET /api/health` | General health check | +| `GET /api/lb-check` | Load-balancer health check | +| `GET /api/_ping` | Same as `lb-check` | +| `GET /api/leader-check` | Leader check for load balancers | +| `GET /api/raft-health` | Raft-specific health check | +| `GET /api/raft-status` | Raft status details | +| `GET /api/status` | Status check (configurable via `StatusEndpoint`) | +| `GET /api/v2/status` | V2 status endpoint | + +### Graphite Integration + +Configure `GraphiteAddr` and `GraphitePath` to push metrics to Graphite: + +```json +{ + "GraphiteAddr": "graphite.example.com:2003", + "GraphitePath": "orchestrator.{hostname}", + "GraphiteConvertHostnameDotsToUnderscores": true, + "GraphitePollSeconds": 60 +} +``` + +### Kubernetes Deployment + +```yaml +livenessProbe: + httpGet: + path: /health/live + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 +readinessProbe: + httpGet: + path: /health/ready + port: 3000 + initialDelaySeconds: 10 + periodSeconds: 5 +``` + +Use `/health/leader` to direct traffic only to the leader in multi-node raft deployments. + +### Discovery Metrics API + +For internal monitoring, orchestrator exposes raw and aggregated metrics via the v1 API: + +| Endpoint | Description | +|----------|-------------| +| `GET /api/discovery-metrics-raw/{seconds}` | Raw discovery metrics for the last N seconds | +| `GET /api/discovery-metrics-aggregated/{seconds}` | Aggregated discovery metrics | +| `GET /api/discovery-queue-metrics-raw/{seconds}` | Raw discovery queue metrics | +| `GET /api/discovery-queue-metrics-aggregated/{seconds}` | Aggregated discovery queue metrics | +| `GET /api/backend-query-metrics-raw/{seconds}` | Raw backend query metrics | +| `GET /api/backend-query-metrics-aggregated/{seconds}` | Aggregated backend query metrics | +| `GET /api/write-buffer-metrics-raw/{seconds}` | Raw write buffer metrics | +| `GET /api/write-buffer-metrics-aggregated/{seconds}` | Aggregated write buffer metrics | diff --git a/docs/tutorials.md b/docs/tutorials.md new file mode 100644 index 00000000..7e6211e2 --- /dev/null +++ b/docs/tutorials.md @@ -0,0 +1,499 @@ +# Tutorials + +Step-by-step guides for common orchestrator workflows. + +--- + +## Tutorial 1: Setting up orchestrator with a MySQL topology + +This tutorial walks you through setting up orchestrator to manage an existing MySQL master-replica topology. + +### What you will need + +- A running MySQL master with one or more replicas (MySQL 5.7+ or 8.0+) +- Go 1.25+ installed +- Network access from the orchestrator host to all MySQL instances on port 3306 + +### Step 1: Build orchestrator + +```bash +git clone https://github.com/proxysql/orchestrator.git +cd orchestrator +go build -o bin/orchestrator ./go/cmd/orchestrator +``` + +### Step 2: Create a MySQL user for orchestrator + +On your MySQL **master** (this will replicate to all replicas automatically): + +```sql +CREATE USER 'orc_topology'@'orchestrator-host' IDENTIFIED BY 'a_secure_password'; +GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orc_topology'@'orchestrator-host'; +``` + +Replace `orchestrator-host` with the hostname or IP of the machine running orchestrator. Use `%` for any host. + +### Step 3: Create a MySQL backend database + +For production use, orchestrator should store its data in MySQL rather than SQLite. On a MySQL instance (can be the same master, or a separate server): + +```sql +CREATE DATABASE orchestrator; +CREATE USER 'orc_server'@'localhost' IDENTIFIED BY 'another_secure_password'; +GRANT ALL ON orchestrator.* TO 'orc_server'@'localhost'; +``` + +### Step 4: Write the configuration file + +Create `orchestrator.conf.json`: + +```json +{ + "Debug": false, + "ListenAddress": ":3000", + "MySQLTopologyUser": "orc_topology", + "MySQLTopologyPassword": "a_secure_password", + "MySQLOrchestratorHost": "127.0.0.1", + "MySQLOrchestratorPort": 3306, + "MySQLOrchestratorDatabase": "orchestrator", + "MySQLOrchestratorUser": "orc_server", + "MySQLOrchestratorPassword": "another_secure_password", + "DefaultInstancePort": 3306, + "DiscoverByShowSlaveHosts": true, + "InstancePollSeconds": 5, + "ReasonableReplicationLagSeconds": 10, + "RecoverMasterClusterFilters": ["*"], + "RecoverIntermediateMasterClusterFilters": ["*"], + "ApplyMySQLPromotionAfterMasterFailover": true, + "FailureDetectionPeriodBlockMinutes": 60, + "RecoveryPeriodBlockSeconds": 3600 +} +``` + +### Step 5: Start orchestrator + +```bash +bin/orchestrator -config orchestrator.conf.json http +``` + +### Step 6: Discover the topology + +```bash +curl http://localhost:3000/api/discover/your-master-host/3306 +``` + +Wait a few seconds for orchestrator to crawl the replicas, then verify: + +```bash +curl -s http://localhost:3000/api/topology/your-master-host/3306 +``` + +You should see your full replication tree printed as indented text. + +### Step 7: Verify in the web UI + +Open `http://localhost:3000` in your browser. Click on **Clusters** in the navigation to see your topology visualized as a tree. + +### Step 8: Test a topology operation + +Move a replica to a different position (dry run with the API): + +```bash +# List replicas of the master +curl -s http://localhost:3000/api/instance-replicas/your-master-host/3306 +``` + +You now have a fully operational orchestrator instance managing your MySQL topology. + +--- + +## Tutorial 2: Configuring ProxySQL failover hooks + +This tutorial sets up orchestrator to automatically update ProxySQL hostgroups during master failover, so your application traffic is rerouted without any custom scripts. + +### Prerequisites + +- A working orchestrator setup (see Tutorial 1) +- ProxySQL installed and running with the Admin interface accessible +- Your MySQL servers already configured as backends in ProxySQL + +### Step 1: Verify ProxySQL Admin access + +```bash +mysql -h 127.0.0.1 -P 6032 -u admin -padmin -e "SELECT * FROM runtime_mysql_servers;" +``` + +You should see your MySQL servers listed with their hostgroups. + +### Step 2: Note your hostgroup IDs + +Identify which hostgroup ID is used for writers and which for readers: + +```bash +mysql -h 127.0.0.1 -P 6032 -u admin -padmin \ + -e "SELECT hostgroup_id, hostname, port, status FROM runtime_mysql_servers;" +``` + +For example, if writers are in hostgroup `10` and readers in hostgroup `20`, you will use those values below. + +### Step 3: Add ProxySQL settings to orchestrator config + +Add these fields to your `orchestrator.conf.json`: + +```json +{ + "ProxySQLAdminAddress": "127.0.0.1", + "ProxySQLAdminPort": 6032, + "ProxySQLAdminUser": "admin", + "ProxySQLAdminPassword": "admin", + "ProxySQLWriterHostgroup": 10, + "ProxySQLReaderHostgroup": 20, + "ProxySQLPreFailoverAction": "offline_soft" +} +``` + +| Field | Description | +|-------|-------------| +| `ProxySQLWriterHostgroup` | The hostgroup ID where the current master lives. Must be > 0 to enable hooks. | +| `ProxySQLReaderHostgroup` | The hostgroup ID for read replicas. Optional but recommended. | +| `ProxySQLPreFailoverAction` | What to do with the old master before failover: `offline_soft` (drain connections), `weight_zero`, or `none`. | + +### Step 4: Restart orchestrator + +```bash +# Stop the running instance (Ctrl+C), then: +bin/orchestrator -config orchestrator.conf.json http +``` + +### Step 5: Verify ProxySQL connectivity + +```bash +curl -s http://localhost:3000/api/proxysql/servers | python3 -m json.tool +``` + +You should see your ProxySQL server list returned as JSON. + +### Step 6: Understand the failover flow + +When orchestrator detects a dead master and performs recovery: + +1. **Pre-failover:** The old master is set to `OFFLINE_SOFT` in ProxySQL (no new connections) +2. **Topology recovery:** Orchestrator promotes a replica to be the new master +3. **Post-failover:** The new master is added to the writer hostgroup; the old master is removed +4. ProxySQL applies changes immediately via `LOAD MYSQL SERVERS TO RUNTIME` + +ProxySQL hooks are non-blocking: if ProxySQL is unreachable, the MySQL failover still proceeds. + +### Step 7: Test with a graceful takeover + +To verify everything works without an actual failure, perform a graceful master takeover: + +```bash +# Identify the current master +curl -s http://localhost:3000/api/clusters + +# Perform a graceful takeover (promotes a replica, demotes the master) +curl -s http://localhost:3000/api/graceful-master-takeover/your-cluster-alias/your-new-master-host/3306 +``` + +Check ProxySQL to confirm the hostgroups updated: + +```bash +mysql -h 127.0.0.1 -P 6032 -u admin -padmin \ + -e "SELECT hostgroup_id, hostname, port, status FROM runtime_mysql_servers;" +``` + +For more details, see the full [ProxySQL hooks documentation](proxysql-hooks.md). + +--- + +## Tutorial 3: Monitoring orchestrator with Prometheus + +This tutorial sets up Prometheus to scrape orchestrator metrics and shows useful queries for alerting. + +### Prerequisites + +- A running orchestrator instance +- Prometheus installed (see [prometheus.io/docs](https://prometheus.io/docs/introduction/first_steps/)) + +### Step 1: Enable Prometheus metrics in orchestrator + +Prometheus metrics are enabled by default. Verify by adding this to your `orchestrator.conf.json` (or confirm it is not explicitly disabled): + +```json +{ + "PrometheusEnabled": true +} +``` + +Restart orchestrator if you changed the config. + +### Step 2: Verify the metrics endpoint + +```bash +curl -s http://localhost:3000/metrics | head -20 +``` + +You should see Prometheus-formatted metrics output. + +### Step 3: Configure Prometheus to scrape orchestrator + +Add a scrape job to your `prometheus.yml`: + +```yaml +scrape_configs: + - job_name: orchestrator + static_configs: + - targets: ['orchestrator-host:3000'] + metrics_path: /metrics + scrape_interval: 15s +``` + +Replace `orchestrator-host` with the actual hostname or IP. Reload Prometheus: + +```bash +kill -HUP $(pgrep prometheus) +# or restart the Prometheus service +``` + +### Step 4: Verify in Prometheus + +Open the Prometheus UI (typically `http://prometheus-host:9090`) and query: + +```promql +orchestrator_instances_total +``` + +You should see the number of MySQL instances orchestrator is managing. + +### Step 5: Useful queries + +**Total known instances and clusters:** + +```promql +orchestrator_instances_total +orchestrator_clusters_total +``` + +**Discovery error rate (over last 5 minutes):** + +```promql +rate(orchestrator_discovery_errors_total[5m]) +``` + +**Recovery operations by type:** + +```promql +sum by (type) (orchestrator_recoveries_total) +``` + +**Recovery duration (p95 over last hour):** + +```promql +histogram_quantile(0.95, rate(orchestrator_recovery_duration_seconds_bucket[1h])) +``` + +### Step 6: Set up alerting rules + +Create an alerting rule file (e.g., `orchestrator-alerts.yml`): + +```yaml +groups: + - name: orchestrator + rules: + - alert: OrchestratorHighDiscoveryErrors + expr: rate(orchestrator_discovery_errors_total[5m]) > 0.1 + for: 10m + labels: + severity: warning + annotations: + summary: "Orchestrator has a high discovery error rate" + description: "More than 0.1 discovery errors/second for the last 10 minutes." + + - alert: OrchestratorRecoveryOccurred + expr: increase(orchestrator_recoveries_total[5m]) > 0 + labels: + severity: critical + annotations: + summary: "Orchestrator performed a recovery" + description: "A failover or recovery event occurred in the last 5 minutes." + + - alert: OrchestratorDown + expr: up{job="orchestrator"} == 0 + for: 2m + labels: + severity: critical + annotations: + summary: "Orchestrator is unreachable" +``` + +Reference this file in your `prometheus.yml`: + +```yaml +rule_files: + - orchestrator-alerts.yml +``` + +### Step 7: Kubernetes health endpoints + +If running orchestrator in Kubernetes, use the built-in health check endpoints for liveness and readiness probes: + +```yaml +livenessProbe: + httpGet: + path: /api/status + port: 3000 + initialDelaySeconds: 10 + periodSeconds: 10 +readinessProbe: + httpGet: + path: /api/status + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + +For the full list of metrics, see the [Observability documentation](observability.md). + +--- + +## Tutorial 4: Using the API v2 + +This tutorial introduces the v2 REST API, which provides structured JSON responses and proper HTTP status codes. + +### Prerequisites + +- A running orchestrator instance with at least one discovered topology + +### Step 1: Understand the response format + +All v2 endpoints return a consistent JSON envelope: + +```json +{ + "status": "ok", + "data": { ... } +} +``` + +On errors: + +```json +{ + "status": "error", + "error": { + "code": "ERROR_CODE", + "message": "Human-readable description" + } +} +``` + +HTTP status codes (200, 400, 404, 500, 503) are used correctly, unlike the v1 API which always returns 200. + +### Step 2: List all clusters + +```bash +curl -s http://localhost:3000/api/v2/clusters | python3 -m json.tool +``` + +Example response: + +```json +{ + "status": "ok", + "data": [ + { + "clusterName": "master.example.com:3306", + "clusterAlias": "production", + "instanceCount": 5 + } + ] +} +``` + +### Step 3: Get cluster details + +```bash +curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306 | python3 -m json.tool +``` + +### Step 4: List instances in a cluster + +```bash +curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/instances | python3 -m json.tool +``` + +### Step 5: Get a specific instance + +```bash +curl -s http://localhost:3000/api/v2/instances/replica1.example.com/3306 | python3 -m json.tool +``` + +### Step 6: View the topology + +```bash +curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/topology | python3 -m json.tool +``` + +### Step 7: Check orchestrator health + +```bash +curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/api/v2/status +``` + +A `200` response means the node is healthy. A `500` response means it is not. + +### Step 8: View recent recoveries + +```bash +# All recent recoveries +curl -s http://localhost:3000/api/v2/recoveries | python3 -m json.tool + +# Filter by cluster +curl -s "http://localhost:3000/api/v2/recoveries?cluster=master.example.com:3306" | python3 -m json.tool + +# Active recoveries only +curl -s http://localhost:3000/api/v2/recoveries/active | python3 -m json.tool +``` + +### Step 9: Query ProxySQL servers via API v2 + +If ProxySQL hooks are configured: + +```bash +# All servers +curl -s http://localhost:3000/api/v2/proxysql/servers | python3 -m json.tool +``` + +If ProxySQL is not configured, you will receive a `503` status: + +```json +{ + "status": "error", + "error": { + "code": "PROXYSQL_NOT_CONFIGURED", + "message": "ProxySQL is not configured" + } +} +``` + +### Step 10: Scripting with the v2 API + +The structured responses make scripting straightforward. Example: get all instance hostnames in a cluster using `jq`: + +```bash +curl -s http://localhost:3000/api/v2/clusters/master.example.com:3306/instances \ + | jq -r '.data[].Key.Hostname' +``` + +Check if any recoveries happened in the last hour: + +```bash +STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/api/v2/recoveries/active) +if [ "$STATUS" = "200" ]; then + ACTIVE=$(curl -s http://localhost:3000/api/v2/recoveries/active | jq '.data | length') + echo "Active recoveries: $ACTIVE" +fi +``` + +For the full endpoint reference, see the [API v2 documentation](api-v2.md). An [OpenAPI 3.0 specification](api/openapi.yaml) is also available for client generation. diff --git a/docs/user-manual.md b/docs/user-manual.md new file mode 100644 index 00000000..e0680d5f --- /dev/null +++ b/docs/user-manual.md @@ -0,0 +1,1148 @@ +# Orchestrator User Manual + +## Table of Contents + +- [Chapter 1: Introduction](#chapter-1-introduction) +- [Chapter 2: Installation and Configuration](#chapter-2-installation-and-configuration) +- [Chapter 3: Topology Discovery](#chapter-3-topology-discovery) +- [Chapter 4: Failure Detection](#chapter-4-failure-detection) +- [Chapter 5: Automated Recovery](#chapter-5-automated-recovery) +- [Chapter 6: ProxySQL Integration](#chapter-6-proxysql-integration) +- [Chapter 7: Monitoring and Observability](#chapter-7-monitoring-and-observability) +- [Chapter 8: High Availability](#chapter-8-high-availability) +- [Chapter 9: API Usage](#chapter-9-api-usage) +- [Chapter 10: Troubleshooting](#chapter-10-troubleshooting) + +--- + +## Chapter 1: Introduction + +### What Orchestrator Does + +Orchestrator is a MySQL high availability and replication management tool. It runs as a service and provides command line access, an HTTP API, and a web interface. Its three core functions are: + +- **Discovery**: Orchestrator actively crawls MySQL replication topologies, mapping master-replica relationships and reading replication status and configuration from each server. +- **Refactoring**: Orchestrator understands replication rules (binlog file:position, GTID, Pseudo-GTID, Binlog Servers) and can safely move replicas between masters. Illegal refactoring attempts are rejected. +- **Recovery**: Orchestrator uses a holistic approach to detect master and intermediate master failures. Based on the state of the topology at the time of failure, it can perform automated or manual failover. + +### Architecture Overview + +Orchestrator follows a continuous loop architecture: + +1. **Discovery loop**: Every few seconds (`InstancePollSeconds`), orchestrator probes each known MySQL instance, reading replication status, server variables, and replica lists. New instances found through replication relationships are automatically added to the discovery queue. + +2. **Analysis loop**: Every second, orchestrator runs failure analysis across all known clusters. It cross-references the health of masters with the replication status reported by their replicas to reach conclusions about failures. + +3. **Recovery**: When a failure is detected and recovery is enabled for that cluster, orchestrator executes pre-recovery hooks, heals the topology (promotes a replica, rearranges siblings), and executes post-recovery hooks. + +Orchestrator stores all topology data in a backend database (MySQL or SQLite). In high-availability deployments, multiple orchestrator nodes coordinate via raft consensus or a shared synchronous database backend. + +### Components + +Orchestrator exposes three interfaces: + +- **HTTP service**: The primary mode of operation. Start with `orchestrator http`. Serves the web UI, the REST API, and runs the continuous discovery/analysis loop. Listens on port 3000 by default. + +- **Command line interface (CLI)**: The `orchestrator` binary and the `orchestrator-client` shell script both provide CLI access. `orchestrator-client` is preferred for raft deployments because it communicates via the HTTP API rather than directly accessing the backend database. + +- **Web UI**: A browser-based interface for visualizing topologies, dragging replicas between masters, viewing cluster analysis, auditing recoveries, and initiating manual operations. + +--- + +## Chapter 2: Installation and Configuration + +### Build from Source + +Building orchestrator requires Go and gcc (for SQLite support via cgo): + +```bash +# Clone the repository +git clone https://github.com/proxysql/orchestrator.git +cd orchestrator + +# Build the binary +./script/build +# Binary is output to bin/orchestrator + +# Or build directly with go: +go build -o bin/orchestrator go/cmd/orchestrator/main.go +``` + +Run unit tests: + +```bash +go test ./go/... +``` + +Build distribution packages (requires fpm): + +```bash +./build.sh # build + package (.deb/.rpm/.tgz) +./build.sh -b # build only, no packages +``` + +### Docker Deployment + +Orchestrator provides Docker-based workflows through the `script/dock` helper: + +```bash +script/dock alpine # Build and run orchestrator service +script/dock test # Build, run unit tests, integration tests, doc tests +script/dock pkg # Build and create distribution packages +script/dock system # Full CI environment with MySQL topology, HAProxy, Consul +``` + +### Configuration File Structure + +Orchestrator reads configuration from a JSON file. It looks in these locations, in order: + +1. `/etc/orchestrator.conf.json` +2. `conf/orchestrator.conf.json` +3. `orchestrator.conf.json` + +You can specify a custom location: + +```bash +orchestrator --config=/path/to/config.json http +``` + +The configuration is a single JSON object. A minimal configuration covers the backend database, MySQL topology credentials, and the listen address: + +```json +{ + "Debug": false, + "ListenAddress": ":3000", + "BackendDB": "sqlite", + "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.db", + "MySQLTopologyUser": "orchestrator", + "MySQLTopologyPassword": "orc_topology_password", + "InstancePollSeconds": 5, + "RecoverMasterClusterFilters": ["*"], + "RecoverIntermediateMasterClusterFilters": ["*"] +} +``` + +The full list of configuration variables is defined in `go/config/config.go`. Key configuration areas are covered throughout this manual. + +### Backend Options: MySQL vs SQLite + +Orchestrator supports two backend databases for storing its own topology data: + +**SQLite** (simplest, no external dependencies): + +```json +{ + "BackendDB": "sqlite", + "SQLite3DataFile": "/var/lib/orchestrator/orchestrator.db" +} +``` + +SQLite is embedded within orchestrator. If the file does not exist, orchestrator creates it. The orchestrator process needs write permissions to the specified path. + +**MySQL** (better performance for busy setups): + +```json +{ + "MySQLOrchestratorHost": "orchestrator.backend.master.com", + "MySQLOrchestratorPort": 3306, + "MySQLOrchestratorDatabase": "orchestrator", + "MySQLOrchestratorCredentialsConfigFile": "/etc/mysql/orchestrator-backend.cnf" +} +``` + +The credentials config file uses MySQL client format: + +```ini +[client] +user=orchestrator_srv +password=${ORCHESTRATOR_PASSWORD} +``` + +Grant the necessary privileges on the backend MySQL server: + +```sql +CREATE USER 'orchestrator_srv'@'orc_host' IDENTIFIED BY 'orc_server_password'; +GRANT ALL ON orchestrator.* TO 'orchestrator_srv'@'orc_host'; +``` + +Alternatively, provide credentials directly in the config (less secure): + +```json +{ + "MySQLOrchestratorUser": "orchestrator_srv", + "MySQLOrchestratorPassword": "orc_server_password" +} +``` + +### Network Requirements + +- **Listen port** (default 3000): The HTTP service port. Must be accessible to users, API clients, and other orchestrator nodes in a raft setup. +- **Raft port** (default 10008): Used for inter-node communication in raft deployments. Must be open between all orchestrator nodes. +- **MySQL topology access**: Orchestrator must be able to reach every MySQL server it monitors on their MySQL ports. +- **MySQL backend access** (if using MySQL backend): Orchestrator must reach its backend database. +- **ProxySQL admin port** (default 6032): Required only if ProxySQL integration is enabled. + +--- + +## Chapter 3: Topology Discovery + +### How Discovery Works + +Orchestrator continuously polls known MySQL instances. The process works as follows: + +1. On startup, orchestrator has no knowledge of any topology. You must seed it with at least one instance per topology. +2. When orchestrator probes an instance, it reads replication status (`SHOW SLAVE STATUS`) and replica information (`SHOW SLAVE HOSTS` or the process list). It discovers the instance's master and its replicas. +3. Newly discovered instances are added to the discovery queue and will be probed on the next cycle. +4. This recursive crawl maps the entire replication topology from a single seed instance. + +Each instance is probed once every `InstancePollSeconds` seconds (default: 5). + +You can seed discovery through any of these methods: + +```bash +# Via orchestrator-client +orchestrator-client -c discover -i mysql-master.example.com:3306 + +# Via the API +curl http://orchestrator:3000/api/discover/mysql-master.example.com/3306 + +# Via a cron job on each MySQL server (recommended for production) +0 0 * * * root "/usr/bin/perl -le 'sleep rand 600' && /usr/bin/orchestrator-client -c discover -i this.hostname.com" +``` + +To disable the continuous polling (for development/testing only): + +```bash +orchestrator --discovery=false http +``` + +### Configuring MySQL Credentials + +Grant orchestrator access to all MySQL topology servers: + +```sql +CREATE USER 'orchestrator'@'orc_host' IDENTIFIED BY 'orc_topology_password'; +GRANT SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD ON *.* TO 'orchestrator'@'orc_host'; +GRANT SELECT ON meta.* TO 'orchestrator'@'orc_host'; +-- Only for NDB Cluster: +GRANT SELECT ON ndbinfo.processes TO 'orchestrator'@'orc_host'; +-- Only for Group Replication / InnoDB Cluster: +GRANT SELECT ON performance_schema.replication_group_members TO 'orchestrator'@'orc_host'; +``` + +Configure the credentials in orchestrator's config file. The recommended approach uses a credentials file: + +```json +{ + "MySQLTopologyCredentialsConfigFile": "/etc/mysql/orchestrator-topology.cnf", + "InstancePollSeconds": 5, + "DiscoverByShowSlaveHosts": false +} +``` + +Where `/etc/mysql/orchestrator-topology.cnf` contains: + +```ini +[client] +user=orchestrator +password=orc_topology_password +``` + +### Cluster Aliases + +By default, orchestrator names clusters after the master's `hostname:port`. You can assign human-readable aliases using a detection query. Create a metadata table on your masters: + +```sql +CREATE TABLE IF NOT EXISTS meta.cluster ( + anchor TINYINT NOT NULL, + cluster_name VARCHAR(128) CHARSET ascii NOT NULL DEFAULT '', + PRIMARY KEY (anchor) +) ENGINE=InnoDB DEFAULT CHARSET=utf8; + +INSERT INTO meta.cluster VALUES (1, 'my_production_cluster'); +``` + +Then configure orchestrator: + +```json +{ + "DetectClusterAliasQuery": "SELECT cluster_name FROM meta.cluster WHERE anchor = 1" +} +``` + +Similarly, you can detect data center and other metadata: + +```json +{ + "DetectDataCenterQuery": "SELECT dc FROM meta.cluster_info WHERE anchor = 1", + "DataCenterPattern": "[.]([a-z]{2}[0-9]+)[.]", + "PhysicalEnvironmentPattern": "[.]([a-z]{2}[0-9]+[-][a-z]+)[.]" +} +``` + +### Discovery Filters + +You can exclude specific hosts from discovery using regex patterns: + +```json +{ + "DiscoveryIgnoreHostnameFilters": [ + "utility-node[.]example[.]com:5000", + ".*[.]staging[.]example[.]com:3306" + ], + "DiscoveryIgnoreReplicaHostnameFilters": [ + "backup-replica[.]example[.]com", + ".*[.]unreachable-dc[.]example[.]com" + ], + "DiscoveryIgnoreMasterHostnameFilters": [ + "old-master[.]example[.]com:3306" + ] +} +``` + +You can also filter by the replication user name used by a replica: + +```json +{ + "DiscoveryIgnoreReplicationUsernameFilters": [ + "debezium_repl", + "ghost_repl_.*" + ] +} +``` + +Control logging of filtered discoveries: + +```json +{ + "EnableDiscoveryFiltersLogs": true +} +``` + +--- + +## Chapter 4: Failure Detection + +### How Orchestrator Detects Failures + +Orchestrator does not rely on simple "can I connect to this server?" checks. Instead, it uses a holistic approach that cross-references information from multiple sources in the replication topology. + +For example, to diagnose a `DeadMaster`: + +1. Orchestrator fails to contact the master. +2. Orchestrator contacts the master's replicas and confirms they also cannot reach the master (replication is broken on all of them). + +This approach triages failures by multiple independent observers rather than by time delays. When all replicas agree their master is unreachable, the replication topology is broken de facto, and failover is justified. + +Detection runs every second. It is always enabled and independent of whether recovery is configured. + +### Types of Failures + +Orchestrator recognizes many failure scenarios. The most important ones: + +**Master failures** (can trigger master failover): + +| Failure Type | Condition | +|---|---| +| `DeadMaster` | Master unreachable, all replicas have broken replication | +| `DeadMasterAndSomeReplicas` | Master unreachable, some replicas also unreachable, remaining replicas have broken replication | +| `DeadMasterAndReplicas` | Master and all replicas unreachable | +| `DeadMasterWithoutReplicas` | Master unreachable, had no replicas | +| `UnreachableMaster` | Master unreachable but replicas still report replication is working (possible network glitch; triggers emergency re-read of replicas) | +| `UnreachableMasterWithLaggingReplicas` | Master unreachable, all replicas are lagging (possible overloaded master; orchestrator restarts replication on replicas to force a connection test) | +| `AllMasterReplicasNotReplicating` | Master is reachable but none of its replicas are replicating | + +**Intermediate master failures** (can trigger intermediate master recovery): + +| Failure Type | Condition | +|---|---| +| `DeadIntermediateMaster` | Intermediate master unreachable, all its replicas have broken replication | +| `DeadIntermediateMasterAndSomeReplicas` | Intermediate master and some of its replicas unreachable | +| `DeadIntermediateMasterWithSingleReplica` | Intermediate master unreachable, has a single replica with broken replication | +| `UnreachableIntermediateMaster` | Intermediate master unreachable but its replicas still report healthy replication | + +**Semi-sync related:** + +| Failure Type | Condition | +|---|---| +| `LockedSemiSyncMaster` | Semi-sync enabled master with insufficient connected semi-sync replicas and a high timeout, causing write locks | +| `MasterWithTooManySemiSyncReplicas` | More semi-sync replicas connected than the configured wait count (requires `EnforceExactSemiSyncReplicas`) | + +**Not considered failures**: Simple replica failures (leaf nodes) and replication lag, even severe lag, are not treated as failure scenarios by orchestrator. + +### Detection Timing and Configuration + +Detection analysis runs every second. Configure anti-spam behavior: + +```json +{ + "FailureDetectionPeriodBlockMinutes": 60 +} +``` + +This prevents orchestrator from firing the same detection notification repeatedly. The detection itself still runs; only the hook invocation is throttled. + +Configure detection hooks: + +```json +{ + "OnFailureDetectionProcesses": [ + "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countReplicas}' >> /tmp/detection.log" + ] +} +``` + +To improve detection speed, configure your MySQL servers: + +```sql +-- Short heartbeat interval so replicas detect failure quickly +SET GLOBAL slave_net_timeout = 4; + +-- Fast reconnection attempts to recover from brief network issues +CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400; +``` + +Without `slave_net_timeout = 4`, some failure scenarios may take up to a minute to detect. + +View the current analysis at any time: + +```bash +# CLI +orchestrator-client -c replication-analysis + +# API +curl http://orchestrator:3000/api/replication-analysis + +# Web UI: Clusters -> Failure analysis +``` + +--- + +## Chapter 5: Automated Recovery + +### Recovery Types + +Orchestrator supports failover for topologies using: + +- **Oracle GTID** (with `MASTER_AUTO_POSITION=1`) +- **MariaDB GTID** +- **Pseudo-GTID** (orchestrator's own mechanism for non-GTID environments) +- **Binlog Servers** + +For GTID and Pseudo-GTID, promotable servers must have `log_bin` and `log_slave_updates` enabled. + +Automated recovery is opt-in. Enable it per cluster or globally: + +```json +{ + "RecoverMasterClusterFilters": ["*"], + "RecoverIntermediateMasterClusterFilters": ["*"], + "RecoveryPeriodBlockSeconds": 3600 +} +``` + +To enable only for specific clusters: + +```json +{ + "RecoverMasterClusterFilters": [ + "production-cluster", + "critical-cluster" + ], + "RecoverIntermediateMasterClusterFilters": ["*"] +} +``` + +A recovery proceeds through these phases: + +1. Pre-recovery hooks execute sequentially. If any returns a non-zero exit code, recovery aborts. +2. Orchestrator heals the topology based on its current state (not hard-coded configuration). +3. Post-recovery hooks execute. + +### Pre/Post Failover Hooks + +Configure hooks as arrays of shell commands: + +```json +{ + "PreFailoverProcesses": [ + "/usr/local/bin/check-failover-ok.sh {failureCluster} {failureType}" + ], + "PostMasterFailoverProcesses": [ + "/usr/local/bin/update-dns.sh {successorHost}", + "/usr/local/bin/notify-team.sh 'Master failover on {failureClusterAlias}: {failedHost} -> {successorHost}'" + ], + "PostIntermediateMasterFailoverProcesses": [], + "PostFailoverProcesses": [ + "echo 'Recovered {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log" + ], + "PostUnsuccessfulFailoverProcesses": [ + "/usr/local/bin/alert-oncall.sh 'FAILED recovery for {failureType} on {failureCluster}'" + ] +} +``` + +Hooks receive failure/recovery information through two mechanisms: + +**Environment variables** (available in hook scripts): + +- `ORC_FAILURE_TYPE`, `ORC_FAILED_HOST`, `ORC_FAILED_PORT` +- `ORC_FAILURE_CLUSTER`, `ORC_FAILURE_CLUSTER_ALIAS` +- `ORC_SUCCESSOR_HOST`, `ORC_SUCCESSOR_PORT` (on successful recovery) +- `ORC_IS_MASTER`, `ORC_COUNT_REPLICAS`, `ORC_LOST_REPLICAS` +- `ORC_COMMAND` (e.g., `graceful-master-takeover`, `force-master-failover`) + +**Template variables** (replaced in the command string): + +- `{failureType}`, `{failedHost}`, `{failedPort}` +- `{failureCluster}`, `{failureClusterAlias}` +- `{successorHost}`, `{successorPort}` (on successful recovery) +- `{countReplicas}`, `{lostReplicas}`, `{replicaHosts}` + +Any command ending with `"&"` executes asynchronously and its failure is ignored. + +### Recovery Blocking and Cooldown + +Orchestrator prevents cascading failures (flapping) with a blocking period: + +```json +{ + "RecoveryPeriodBlockSeconds": 3600 +} +``` + +After a cluster experiences a recovery, no further automated recoveries will run on that same cluster for the specified duration. This block applies only to the same cluster; concurrent recoveries on different clusters are allowed. + +Unblock recoveries before the cooldown expires by acknowledging: + +```bash +orchestrator-client -c ack-cluster-recoveries -alias mycluster +``` + +Manual recovery (`orchestrator-client -c recover` or `force-master-failover`) ignores the blocking period. + +### Promotion Rules + +Control which replicas are preferred for promotion: + +```bash +# Register a candidate with a promotion preference (expires after 1 hour) +orchestrator-client -c register-candidate -i replica.example.com --promotion-rule prefer +``` + +Supported promotion rules: `prefer`, `neutral`, `prefer_not`, `must_not`. + +Set up a cron job to keep preferences current: + +``` +*/2 * * * * root "/usr/bin/perl -le 'sleep rand 10' && /usr/bin/orchestrator-client -c register-candidate -i this.hostname.com --promotion-rule prefer" +``` + +Additional promotion controls: + +```json +{ + "ApplyMySQLPromotionAfterMasterFailover": true, + "PreventCrossDataCenterMasterFailover": false, + "PreventCrossRegionMasterFailover": false, + "FailMasterPromotionIfSQLThreadNotUpToDate": true, + "DelayMasterPromotionIfSQLThreadNotUpToDate": false, + "DetachLostReplicasAfterMasterFailover": true, + "MasterFailoverLostInstancesDowntimeMinutes": 10 +} +``` + +### Graceful Master Takeover + +For planned maintenance (upgrades, host migration), use graceful master takeover instead of waiting for a failure: + +```bash +# Specify the designated new master +orchestrator-client -c graceful-master-takeover -alias mycluster -d new-master.example.com:3306 + +# Let orchestrator choose the best replica and start replication on demoted master +orchestrator-client -c graceful-master-takeover-auto -alias mycluster +``` + +The process: + +1. The designated replica takes over its siblings as intermediate master. +2. The current master is set to `read-only`. +3. The designated replica catches up with replication. +4. The designated replica is promoted as the new master and set to writable. +5. The old master is demoted and placed as a replica of the new master. + +Dedicated hooks are available: + +```json +{ + "PreGracefulTakeoverProcesses": [ + "/usr/local/bin/silence-alerts.sh {failureCluster}" + ], + "PostGracefulTakeoverProcesses": [ + "/usr/local/bin/restore-alerts.sh {failureCluster}" + ] +} +``` + +These run in addition to the standard failover hooks. Within the standard hooks, check the `{command}` placeholder or `ORC_COMMAND` environment variable for the value `graceful-master-takeover` to distinguish planned from unplanned failovers. + +### Manual and Forced Failover + +When auto-recovery is disabled or blocked, you can manually trigger recovery: + +```bash +# Manual recovery (instance must be recognized as failed) +orchestrator-client -c recover -i dead.instance.com:3306 + +# Force master failover regardless of orchestrator's analysis +orchestrator-client -c force-master-failover --alias mycluster +``` + +--- + +## Chapter 6: ProxySQL Integration + +### Setting Up ProxySQL Hooks + +Orchestrator has built-in support for updating ProxySQL hostgroups during failover. No custom scripts are needed. + +Add the following to `orchestrator.conf.json`: + +```json +{ + "ProxySQLAdminAddress": "127.0.0.1", + "ProxySQLAdminPort": 6032, + "ProxySQLAdminUser": "admin", + "ProxySQLAdminPassword": "admin", + "ProxySQLWriterHostgroup": 10, + "ProxySQLReaderHostgroup": 20, + "ProxySQLPreFailoverAction": "offline_soft" +} +``` + +Configuration reference: + +| Setting | Default | Description | +|---|---|---| +| `ProxySQLAdminAddress` | (empty) | ProxySQL admin host. Leave empty to disable hooks. | +| `ProxySQLAdminPort` | 6032 | ProxySQL admin port | +| `ProxySQLAdminUser` | admin | Admin interface username | +| `ProxySQLAdminPassword` | (empty) | Admin interface password | +| `ProxySQLAdminUseTLS` | false | Use TLS for admin connection | +| `ProxySQLWriterHostgroup` | 0 | Writer hostgroup ID. Must be > 0 to enable hooks. | +| `ProxySQLReaderHostgroup` | 0 | Reader hostgroup ID (optional) | +| `ProxySQLPreFailoverAction` | offline_soft | Pre-failover action: `offline_soft`, `weight_zero`, or `none` | + +### How Failover Updates ProxySQL + +**Pre-failover** (when a dead master is detected): + +- `offline_soft`: Sets the old master to `OFFLINE_SOFT` in ProxySQL. Existing connections finish but no new ones are routed. +- `weight_zero`: Sets the old master's weight to 0. +- `none`: Skips pre-failover ProxySQL changes. + +**Post-failover** (after a new master is promoted): + +1. Old master is removed from the writer hostgroup. +2. New master is added to the writer hostgroup. +3. If a reader hostgroup is configured: new master is removed from readers; old master is added to readers as `OFFLINE_SOFT`. +4. `LOAD MYSQL SERVERS TO RUNTIME` and `SAVE MYSQL SERVERS TO DISK` are executed. + +The failover timeline: + +``` +Dead master detected + -> OnFailureDetectionProcesses (scripts) + -> PreFailoverProcesses (scripts) + -> ProxySQL pre-failover: drain old master + -> [topology manipulation: elect new master] + -> KV store updates (Consul/ZK) + -> ProxySQL post-failover: promote new master + -> PostMasterFailoverProcesses (scripts) + -> PostFailoverProcesses (scripts) +``` + +ProxySQL hooks run alongside existing script-based hooks. They are non-blocking: if ProxySQL is unreachable, failover proceeds normally. Post-failover ProxySQL errors are logged but do not mark the recovery as failed. + +### ProxySQL Topology API + +Query ProxySQL's runtime server list through orchestrator's API: + +```bash +# List all servers +curl http://orchestrator:3000/api/proxysql/servers + +# List servers in a specific hostgroup +curl http://orchestrator:3000/api/proxysql/servers/10 +``` + +Response format: + +```json +{ + "Code": "OK", + "Message": "Found 4 servers", + "Details": [ + { + "hostgroup_id": 10, + "hostname": "db1.example.com", + "port": 3306, + "status": "ONLINE", + "weight": 1000, + "max_connections": 100, + "comment": "" + } + ] +} +``` + +### CLI Commands + +```bash +# Test ProxySQL connectivity +orchestrator-client -c proxysql-test + +# Show ProxySQL server list +orchestrator-client -c proxysql-servers +``` + +### Multiple ProxySQL Instances + +For ProxySQL Cluster deployments, configure orchestrator to connect to one ProxySQL node. Changes propagate automatically via ProxySQL's built-in cluster synchronization. + +For non-clustered ProxySQL, use `PostMasterFailoverProcesses` script hooks to update additional ProxySQL instances. + +--- + +## Chapter 7: Monitoring and Observability + +### Prometheus Metrics + +Orchestrator exposes a `/metrics` endpoint in Prometheus scrape format. It is enabled by default. + +```json +{ + "PrometheusEnabled": true +} +``` + +Available metrics: + +| Metric | Type | Description | +|---|---|---| +| `orchestrator_discoveries_total` | Counter | Total discovery attempts | +| `orchestrator_discovery_errors_total` | Counter | Failed discoveries | +| `orchestrator_instances_total` | Gauge | Known instances | +| `orchestrator_clusters_total` | Gauge | Known clusters | +| `orchestrator_recoveries_total` | Counter | Recovery attempts (labels: `type`, `result`) | +| `orchestrator_recovery_duration_seconds` | Histogram | Duration of recovery operations | + +Prometheus scrape configuration: + +```yaml +scrape_configs: + - job_name: orchestrator + static_configs: + - targets: ['orchestrator:3000'] + metrics_path: /metrics + scrape_interval: 15s +``` + +### Health Endpoints for Kubernetes + +Three health check endpoints are provided: + +**`GET /health/live`** -- Liveness probe. Returns `200 OK` if the orchestrator process is running. Lightweight; does not query any backend. + +```json +{"status": "alive"} +``` + +**`GET /health/ready`** -- Readiness probe. Returns `200 OK` if the backend database is connected and health check registration is succeeding. Returns `503` otherwise. + +```json +{"status": "ready"} +``` + +**`GET /health/leader`** -- Leader check. Returns `200 OK` if this node is the raft leader (or the active node in non-raft setups). Returns `503` otherwise. Useful for directing writes only to the leader via a load balancer. + +```json +{"status": "leader"} +``` + +Kubernetes deployment example: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: orchestrator +spec: + template: + spec: + containers: + - name: orchestrator + ports: + - containerPort: 3000 + livenessProbe: + httpGet: + path: /health/live + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /health/ready + port: 3000 + initialDelaySeconds: 10 + periodSeconds: 5 +``` + +### Web UI Overview + +The web interface (accessible at `http://orchestrator:3000/`) provides: + +- **Cluster visualization**: Interactive topology diagrams showing masters, replicas, replication status, and problems. +- **Drag-and-drop refactoring**: Move replicas between masters by dragging them in the topology view. +- **Cluster analysis**: View current failure analysis under Clusters -> Failure analysis. +- **Recovery audit**: Audit past recoveries at `/web/audit-recovery`. +- **Manual actions**: Click on instances to begin downtime, recover, or inspect details. + +### Graphite (Legacy) + +Orchestrator can also publish metrics to Graphite. This is a legacy feature; Prometheus is the recommended monitoring integration. Refer to the Graphite-related configuration variables in `go/config/config.go` if needed. + +--- + +## Chapter 8: High Availability + +Orchestrator itself can be deployed in a highly available configuration. There are two primary approaches. + +### Shared Backend Deployment + +Multiple orchestrator nodes all connect to the same database backend. The backend must be a synchronous replication cluster: + +- Galera +- Percona XtraDB Cluster +- InnoDB Cluster +- NDB Cluster + +``` +[ orchestrator-1 ] --\ +[ orchestrator-2 ] ----> [ Galera / XtraDB Cluster ] +[ orchestrator-3 ] --/ +``` + +Two variations: + +- **Single-writer mode**: All orchestrator nodes talk to one writer DB via a proxy. If the writer fails, the synchronous cluster promotes a new writer and the proxy redirects traffic. +- **Multi-writer mode**: Each orchestrator node is paired with a local DB node. Since replication is synchronous, there is no split brain. Only one orchestrator node is the leader. + +Access any healthy orchestrator node for API requests. For writes, direct traffic to the leader using `/api/leader-check` as a proxy health check. + +### Raft Consensus Deployment + +Orchestrator nodes communicate directly via the raft consensus algorithm. Each node has its own private backend database (MySQL or SQLite). + +``` +[ orchestrator-1 + SQLite ] <--raft--> [ orchestrator-2 + SQLite ] <--raft--> [ orchestrator-3 + SQLite ] +``` + +Configure on each node: + +```json +{ + "RaftEnabled": true, + "RaftDataDir": "/var/lib/orchestrator", + "RaftBind": "10.0.0.1", + "DefaultRaftPort": 10008, + "RaftNodes": [ + "10.0.0.1", + "10.0.0.2", + "10.0.0.3" + ] +} +``` + +Each node must have `RaftBind` set to its own address. `RaftNodes` lists all nodes and must be identical across the cluster. + +For NAT or firewall scenarios: + +```json +{ + "RaftAdvertise": "public-ip-or-fqdn", + "HTTPAdvertise": "http://my.public.hostname:3000" +} +``` + +Key behaviors in a raft deployment: + +- Only the leader runs recoveries. +- All nodes independently discover and probe MySQL topologies. +- All nodes run failure detection. +- Non-leader nodes reverse-proxy HTTP requests to the leader. +- Clients must only interact with the leader. Use a proxy with `/api/leader-check`, or use `orchestrator-client` with multiple backends (it auto-detects the leader). + +Recommended setup: 3 or 5 nodes. SQLite requires no external dependencies; MySQL outperforms SQLite for busy environments. + +### Active/Passive Setup + +A simpler but less resilient approach uses a MySQL backend with standard replication: + +``` +[ orchestrator-1 (active) ] --> [ MySQL master ] +[ orchestrator-2 (standby) ] --> [ MySQL master ] + | + [ MySQL replica ] +``` + +Multiple orchestrator nodes talk to the same MySQL master. If the MySQL master dies, someone or something else must failover the orchestrator backend. Orchestrator cannot failover its own backend database. + +A master-master MySQL backend with a proxy (e.g., HAProxy using `first` algorithm) provides slightly better resilience, but split brain is possible depending on the physical setup and proxy configuration. + +--- + +## Chapter 9: API Usage + +### v1 API Overview + +The v1 API is accessible under `/api/`. All endpoints use HTTP GET. The web UI is built entirely on these API calls. + +Common endpoints: + +```bash +# Instance operations +/api/instance/:host/:port # Read instance details +/api/discover/:host/:port # Trigger discovery of an instance +/api/relocate/:host/:port/:belowHost/:belowPort # Move a replica + +# Cluster operations +/api/cluster-info/:clusterHint # Cluster metadata +/api/cluster/alias/:alias # All instances in a cluster +/api/replication-analysis # Current failure analysis + +# Recovery operations +/api/recover/:host/:port # Initiate recovery +/api/recover-lite/:host/:port # Recover without invoking external hooks +/api/force-master-failover/:clusterHint # Force immediate failover +/api/graceful-master-takeover/:clusterHint/:host/:port # Planned failover + +# Recovery management +/api/audit-recovery # List past recoveries +/api/blocked-recoveries # View blocked recoveries +/api/ack-recovery/cluster/:clusterHint # Acknowledge a recovery +/api/disable-global-recoveries # Disable all recoveries globally +/api/enable-global-recoveries # Re-enable recoveries + +# Health +/api/leader-check # 200 if this node is leader +``` + +Example workflows: + +```bash +# Get cluster info +curl -s "http://orchestrator:3000/api/cluster-info/my_cluster" | jq . + +# Find the master of a cluster +curl -s "http://orchestrator:3000/api/cluster/alias/my_cluster" | jq '.[] | select(.ReplicationDepth==0) .Key.Hostname' -r + +# Find direct replicas of the master +curl -s "http://orchestrator:3000/api/cluster/alias/my_cluster" | jq '.[] | select(.ReplicationDepth==1) .Key.Hostname' -r + +# Find instances without binary logging +curl -s "http://orchestrator:3000/api/cluster/alias/my_cluster" | jq '.[] | select(.LogBinEnabled==false) .Key.Hostname' -r +``` + +### v2 API with Structured Responses + +The v2 API is mounted under `/api/v2/` and provides consistent JSON envelopes, proper HTTP status codes, and RESTful URL structure. + +An OpenAPI 3.0 specification is available at `docs/api/openapi.yaml`. + +All responses follow this format: + +```json +{ + "status": "ok", + "data": { ... } +} +``` + +Error responses: + +```json +{ + "status": "error", + "error": { + "code": "NOT_FOUND", + "message": "Cluster not found" + } +} +``` + +HTTP status codes: `200` success, `400` bad input, `404` not found, `500` internal error, `503` service unavailable. + +Available v2 endpoints: + +``` +GET /api/v2/clusters # List all clusters +GET /api/v2/clusters/{name} # Cluster details +GET /api/v2/clusters/{name}/instances # Instances in a cluster +GET /api/v2/clusters/{name}/topology # ASCII topology +GET /api/v2/instances/{host}/{port} # Instance details +GET /api/v2/recoveries # Recent recoveries (?cluster=, ?alias=, ?page=) +GET /api/v2/recoveries/active # In-progress recoveries +GET /api/v2/status # Node health status +GET /api/v2/proxysql/servers # ProxySQL server list +``` + +### Common API Workflows + +**Monitor cluster health** -- poll the analysis endpoint and alert on new failures: + +```bash +curl -s http://orchestrator:3000/api/replication-analysis | jq '.[] | select(.IsActionableRecovery==true)' +``` + +**Automate discovery of new clusters** -- call the discover endpoint when provisioning new MySQL servers: + +```bash +curl -s http://orchestrator:3000/api/discover/new-server.example.com/3306 +``` + +**Check recovery status after a failover**: + +```bash +# v1 +curl -s http://orchestrator:3000/api/audit-recovery | jq '.[0]' + +# v2 +curl -s http://orchestrator:3000/api/v2/recoveries | jq '.data[0]' +``` + +**Integrate with CI/CD** -- verify no active recoveries before deploying: + +```bash +active=$(curl -s http://orchestrator:3000/api/v2/recoveries/active | jq '.data | length') +if [ "$active" -gt 0 ]; then + echo "Active recovery in progress, delaying deployment" + exit 1 +fi +``` + +--- + +## Chapter 10: Troubleshooting + +### Common Issues and Solutions + +**Orchestrator cannot connect to MySQL topology servers** + +- Verify the grants: orchestrator needs `SUPER, PROCESS, REPLICATION SLAVE, REPLICATION CLIENT, RELOAD` on all topology servers. +- Check that `MySQLTopologyCredentialsConfigFile` points to a valid, readable file. +- Ensure network connectivity between the orchestrator host and all MySQL servers on their MySQL ports. +- Check for firewall rules blocking access. + +**Orchestrator backend database errors** + +- For MySQL backend: verify the backend MySQL server is running and the orchestrator user has `ALL` privileges on the `orchestrator` database. +- For SQLite: verify the directory for `SQLite3DataFile` exists and is writable by the orchestrator process. +- Check the `MySQLOrchestratorMaxAllowedPacket` setting if you see packet-size errors. + +**Cluster shows as "unknown" or has no alias** + +- Ensure `DetectClusterAliasQuery` is configured and returns a value from each cluster's master. +- The metadata table must exist and be populated on the master. +- Verify orchestrator has `SELECT` privileges on the metadata schema. + +**Instances not appearing in topology** + +- Seed discovery: `orchestrator-client -c discover -i hostname:port` +- Check `DiscoveryIgnoreHostnameFilters` and related filters to make sure the host is not excluded. +- Verify `DiscoverByShowSlaveHosts` matches your MySQL configuration. If set to `true`, replicas need `report_host` configured. + +### Debug Logging + +Enable verbose logging with command-line flags: + +```bash +# Debug messages +orchestrator --debug http + +# Debug messages with stack traces on errors +orchestrator --debug --stack http +``` + +The `Debug` configuration option can also be set in the config file: + +```json +{ + "Debug": true +} +``` + +### Recovery Not Triggering + +If orchestrator detects a failure but does not recover: + +1. **Check if recovery is enabled for the cluster**: + ```bash + curl -s http://orchestrator:3000/api/cluster-info/mycluster | jq '.HasAutomatedMasterRecovery' + ``` + Verify `RecoverMasterClusterFilters` includes the cluster name/alias or `"*"`. + +2. **Check if global recoveries are disabled**: + ```bash + orchestrator-client -c check-global-recoveries + ``` + +3. **Check if the instance is downtimed**: + ```bash + orchestrator-client -c replication-analysis + ``` + Downtimed instances are skipped for automated recovery. Use `orchestrator-client -c end-downtime -i hostname:port` to remove downtime. + +4. **Check for anti-flapping blocking**: + ```bash + curl -s http://orchestrator:3000/api/blocked-recoveries + ``` + A recent recovery on the same cluster blocks further automated recoveries for `RecoveryPeriodBlockSeconds`. Acknowledge the previous recovery to unblock: + ```bash + orchestrator-client -c ack-cluster-recoveries -alias mycluster + ``` + +5. **Check the failure type**: Not all failure types trigger recovery. `UnreachableMaster` (where replicas still report healthy replication) does not trigger recovery -- it triggers an emergency re-read of replicas instead. + +6. **In raft mode, check that this node is the leader**: Only the leader runs recoveries. + ```bash + curl -s http://orchestrator:3000/api/leader-check + ``` + +### ProxySQL Hook Failures + +ProxySQL hook errors do not block failover. Check the orchestrator log for ProxySQL-related messages. + +**ProxySQL not configured error**: +- Verify `ProxySQLAdminAddress` is set (non-empty) and `ProxySQLWriterHostgroup` is greater than 0. + +**Connection refused to ProxySQL admin**: +- Verify `ProxySQLAdminPort` (default 6032), `ProxySQLAdminUser`, and `ProxySQLAdminPassword`. +- Test connectivity: + ```bash + orchestrator-client -c proxysql-test + ``` + +**Changes not reflected in ProxySQL**: +- Orchestrator executes `LOAD MYSQL SERVERS TO RUNTIME` and `SAVE MYSQL SERVERS TO DISK` after changes. Verify these commands succeeded in the log. +- For ProxySQL Cluster setups, changes propagate via `proxysql_servers` synchronization. Verify the cluster is healthy. + +**Post-failover ProxySQL update failed but recovery succeeded**: +- ProxySQL post-failover errors are logged but do not mark the MySQL recovery as failed. Manually update ProxySQL if needed: + ```sql + -- Connect to ProxySQL admin + UPDATE mysql_servers SET hostgroup_id=10 WHERE hostname='new-master'; + LOAD MYSQL SERVERS TO RUNTIME; + SAVE MYSQL SERVERS TO DISK; + ```