Introduction
MySQL replication distributes data across multiple servers for read scaling, high availability, and disaster recovery. This comprehensive guide covers setting up master-slave and master-master replication, monitoring, failover procedures, and troubleshooting common issues.
Understanding MySQL Replication
Replication works through binary logs: master records changes (binary log), slave reads changes (relay log) and applies them. Asynchronous by default: master commits without waiting for slaves. Semi-synchronous: master waits for at least one slave acknowledgment. Replication formats: STATEMENT (default up to 5.7, logs SQL statements), ROW (recommended, logs row changes, consistent), MIXED (automatic choice). GTID (Global Transaction Identifiers, MySQL 5.6+) simplifies failover: every transaction has unique ID, slaves can auto-skip applied transactions. MTS (Multi-Threaded Slave) improves performance for write-heavy workloads (MySQL 8.0+).
Master-Slave Replication Setup
Master configuration (/etc/mysql/my.cnf or /etc/my.cnf): [mysqld], server-id = 1 (unique), log_bin = /var/log/mysql/mysql-bin.log, binlog_format = ROW, binlog_do_db = myapp (optional, specify databases), binlog_ignore_db = mysql (ignore system db). Restart MySQL: systemctl restart mysql. Create replication user on master: CREATE USER 'replica'@'%' IDENTIFIED BY 'secure_password', GRANT REPLICATION SLAVE ON *.* TO 'replica'@'%', FLUSH PRIVILEGES. Get master status: SHOW MASTER STATUS; (record File and Position). Export data: mysqldump --all-databases --master-data=2 --single-transaction > master_dump.sql on master. Copy dump to slave. Slave configuration: server-id = 2, relay_log = /var/log/mysql/mysql-relay-bin.log, read_only = 1 (prevents writes), log_slave_updates = 1 (for cascading). Import dump on slave: mysql < master_dump.sql. Configure replication: CHANGE MASTER TO MASTER_HOST='master_host', MASTER_USER='replica', MASTER_PASSWORD='secure_password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=123; START SLAVE;. Verify: SHOW SLAVE STATUS G (Slave_IO_Running: Yes, Slave_SQL_Running: Yes).
Master-Master (Circular) Replication
Active-active setup (both servers accept writes). Unique server IDs: Master1 ID 1, Master2 ID 2. Both: log_bin, auto_increment_increment=2 (step size), auto_increment_offset=1 (master1) and 2 (master2) to prevent key conflicts. Each configured as slave of other. Steps: configure Master1 as slave of Master2 (CHANGE MASTER TO ...), configure Master2 as slave of Master1. Both: log_slave_updates = 1 (propagate changes from other master). Potential issues: conflict resolution (same row updated on both), circular dependency, auto_increment management. Use with caution: single-threaded writes per table, application-level partitioning recommended. Alternative: use MySQL Group Replication (InnoDB Cluster) instead of master-master for better consistency.
GTID-Based Replication
GTID simplifies failover and replication setup. Configure all servers: gtid_mode = ON, enforce_gtid_consistency = ON. Show GTID executed: SHOW MASTER STATUS; shows Executed_Gtid_Set. Setup without dump coordinates: CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='replica', MASTER_PASSWORD='pass', MASTER_AUTO_POSITION = 1; START SLAVE;. GTID auto-sync finds correct position. Failover: promote slave to master, point other slaves to new master with same command. Skip problematic transactions: STOP SLAVE; SET GTID_NEXT='uuid:transaction_id'; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC'; START SLAVE;. GTID limitations: temporary tables, non-transactional engines, CREATE TABLE ... SELECT may cause issues.
Replication Filtering
Replicate only specific databases or tables. Master-side (binlog_do_db/binlog_ignore_db in my.cnf) - affects what goes into binary log. Slave-side replication filters (more common): replicate_do_db = myapp, replicate_ignore_table = myapp.logs, replicate_wild_do_table = myapp.%, replicate_wild_ignore_table = myapp.temp_%. Set in slave my.cnf: replicate-ignore-db=mysql, replicate-do-db=myapp. Dynamic change: STOP SLAVE SQL_THREAD; CHANGE REPLICATION FILTER REPLICATE_DO_DB = (myapp); START SLAVE SQL_THREAD;. Filtering risks: cross-database queries problematic, DDL statements may not replicate correctly. Test thoroughly before production.
Monitoring Replication Health
Key metrics from SHOW SLAVE STATUS G: Seconds_Behind_Master (replication lag, 0 ideal), Slave_IO_Running/Slave_SQL_Running (both Yes), Last_IO_Errno/Last_IO_Error, Last_SQL_Errno/Last_SQL_Error, Relay_Log_Space, Relay_Log_File/Relay_Log_Pos. Monitoring queries: SELECT * FROM performance_schema.replication_applier_status; SELECT * FROM performance_schema.replication_connection_status;. Heartbeat table for lag measurement: CREATE TABLE heartbeat (id INT PRIMARY KEY, ts TIMESTAMP); insert row on master, check diff on slave. Nagios/Icinga plugin: check_mysql_replication. Prometheus exporter: mysql_exporter collects replication metrics. Alert when Seconds_Behind_Master > 60 or replication stops. Log replication errors to separate file: log_error = /var/log/mysql/error.log.
Handling Replication Lag
Lag causes: slow queries on slave, insufficient slave resources, binary log contention, large transactions, row-based replication with no indexes. Solutions: upgrade slave hardware (more RAM, faster disk), use SSD/NVMe for relay logs, enable multi-threaded slave: slave_parallel_workers = 4, slave_parallel_type = LOGICAL_CLOCK (MySQL 8.0). Set slave_compressed_protocol = 1 for slow network. Use tmpdir on ramdisk for temporary tables. Partition large tables, break large transactions into smaller batches. Monitor lag with pt-heartbeat (Percona Toolkit) for sub-second resolution. Write-intensive workloads: consider read replica in same region, separate analytics queries.
Replication Failover Procedures
Planned failover (maintenance): Stop writes on master, wait for slaves to catch up (Seconds_Behind_Master=0), promote slave to master: STOP SLAVE; RESET MASTER; on slave. Point other slaves to new master: CHANGE MASTER TO MASTER_HOST='new_master'. Update applications with new master address (DNS, config). Unplanned failover (master crash): Identify most up-to-date slave (Relay_Master_Log_File, Exec_Master_Log_Pos). On candidate slave: STOP SLAVE; RESET MASTER;. On other slaves: CHANGE MASTER TO MASTER_HOST='new_master', MASTER_LOG_FILE='...', MASTER_LOG_POS=...;. Potential data loss (asynchronous replication) - may need semi-sync for zero loss. Document runbook, practice quarterly. Use Orchestrator or MHA for automated failover.
Semi-Synchronous Replication
Semi-sync ensures at least one slave confirms receipt before master commits. Install plugin: INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so', INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so'. Enable: SET GLOBAL rpl_semi_sync_master_enabled = 1; SET GLOBAL rpl_semi_sync_slave_enabled = 1;. Settings: rpl_semi_sync_master_timeout = 10000 (fallback to async after 10sec), rpl_semi_sync_master_wait_for_slave_count = 1, rpl_semi_sync_master_wait_point = AFTER_SYNC (default, better consistency) or AFTER_COMMIT. Monitor: SHOW STATUS LIKE 'Rpl_semi_sync%';. Trade-off: increased write latency (RTT to slave). Use for critical data (payments, user accounts), not for logs or analytics. Combine with proxy layer to route non-critical writes to async slaves.
Encrypted and Secure Replication
Encrypt replication traffic with TLS: On master: GRANT REPLICATION SLAVE ON *.* TO 'replica'@'%' REQUIRE SSL;. Generate certificates: openssl req -newkey rsa:2048 -nodes -keyout master-key.pem -x509 -days 365 -out master-cert.pem. Configure SSL in my.cnf: master: ssl_ca, ssl_cert, ssl_key. Slave connection: CHANGE MASTER TO MASTER_SSL=1, MASTER_SSL_CA='ca-cert.pem', MASTER_SSL_CERT='client-cert.pem', MASTER_SSL_KEY='client-key.pem'. Restrict replication user source IP: GRANT REPLICATION SLAVE ON *.* TO 'replica'@'192.168.1.%'. Use non-standard port (3307) for replication. Separate replication network (VLAN or private IP). audit log on slave for compliance.
Troubleshooting Common Issues
Error 1032 (key not found): row missing on slave. Workaround: SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; (skip one statement). Debug: compare row on master and slave, repair with pt-table-sync. Error 1062 (duplicate entry): duplicate key on slave. Often from INSERT. Skip or resolve: REPLACE INTO or DELETE duplicate. Use slave_exec_mode=IDEMPOTENT for STATEMENT replication (not ROW). Connection errors: check SHOW SLAVE STATUS Last_IO_Error. Common: access denied (wrong password), host unreachable (firewall), master changed (change master to update host). Relay log corruption: STOP SLAVE; RESET SLAVE; START SLAVE; (re-fetch from master). Disk full on slave: clean relay logs: PURGE RELAY LOGS BEFORE NOW();. Consistent snapshot errors: use mysqldump --master-data or --dump-slave for proper replication start position.
Delayed Replication for Disaster Recovery
Delayed slave intentionally lags behind master to recover from user errors (DROP TABLE). Configure: CHANGE MASTER TO MASTER_DELAY = 3600 (1 hour lag). Slave applies events after delay. Recovery after accidental DROP: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; to skip bad statement, or use mysqlbinlog --start-datetime to recover. Checkpoint: relay logs kept for delay period. Delayed slave also useful for reporting, analytics (consistent view over time). Monitor delay: SELECT timediff(now(), (SELECT variable_value FROM information_schema.global_status WHERE variable_name='LAST_IO_ERRNO') )) ;. Multiple delayed slaves for different RPOs (1h, 6h, 24h).
Alternative: Group Replication (InnoDB Cluster)
MySQL 8.0 Group Replication provides multi-master consistent replication. Set up: [mysqld], loose-group_replication_group_name=... , loose-group_replication_local_address=... . Enable: SET GLOBAL group_replication_bootstrap_group=ON; START GROUP_REPLICATION;. Add nodes: SET GLOBAL group_replication_group_seeds=...; START GROUP_REPLICATION;. Single-primary (writes to one node) or multi-primary (writes to any). Quorum consensus (majority). Automatic failover and recovery. Limitations: all tables must have primary key, network latency affects performance. Use MySQL Router for client connection routing. InnoDB Cluster + MySQL Router = complete HA solution. For small to medium workloads, easier than traditional replication.
Backup Strategies with Replication
Offload backups to replica: run mysqldump or XtraBackup on slave to avoid impacting master. Backup replication coordinates: mysqldump --master-data=2 --dump-slave=2 (records replication position at backup time). Point-in-time recovery: replay binary logs from backup position to error time. For slave backup: STOP SLAVE SQL_THREAD; then backup (consistent snapshot), START SLAVE;. Use slave as source for logical backups (requires storage space for binary logs). Test restoration process quarterly. Backup retention matches delayed slave capabilities. Consider Percona XtraBackup for hot physical backups (faster restore).
Conclusion
MySQL replication provides powerful scaling and resilience. Start with simple master-slave for read scaling, add monitoring first, implement GTID for easier failover, and consider semi-sync for critical writes. Practice failover regularly, document procedures, and test disaster recovery scenarios annually.