Redis Sentinel Bug Report

Describe the bug

Redis Sentinel promotes slaves with wrong cluster data during failover due to missing replication ID validation. When a slave is monitored by multiple Sentinel groups (due to IP address reuse), Sentinel can promote a slave containing data from a completely different cluster, causing massive data corruption across the entire target cluster.

To reproduce

Environmental Setup That Led to This Bug

Background: IP Address Reuse Scenario

  1. Original State: IP address was previously used in Cluster A Redis deployment
  2. Infrastructure Change: Same IP was later reassigned to Cluster B Redis deployment
  3. Configuration Gap: Cluster A Sentinel configurations were not cleaned up after IP reassignment
  4. Result: Single Redis instance monitored by TWO different Sentinel groups

Reproduction Scenario

This bug can be reproduced in environments where:

  1. IP Address Reuse: An IP address is reassigned from one Redis cluster to another
  2. Stale Sentinel Configuration: Old Sentinel configurations are not properly cleaned up
  3. Dual Monitoring: The same Redis instance gets monitored by multiple Sentinel groups

The Oscillation Pattern: - The Redis instance receives conflicting REPLICAOF commands from different Sentinel groups - It oscillates between replicating from different cluster masters - At failover time, the instance contains data from the "wrong" cluster - Sentinel promotes it without validating the data source

Critical Sequence: 1. Redis instance replicates from Cluster A master (contains Cluster A data) 2. Cluster B master fails 3. Cluster B Sentinels promote the instance as new Cluster B master 4. Bug: Cluster A data is now served as Cluster B data 5. All Cluster B slaves sync from the contaminated master 6. Massive data corruption spreads across entire Cluster B

Simplified Test Case for Redis Developers

# Setup two Redis clusters
redis-server --port 6379 --replicaof no one  # Cluster A Master
redis-server --port 6380 --replicaof no one  # Cluster B Master  
redis-server --port 6381 --replicaof 127.0.0.1 6380  # Oscillating slave (starts with Cluster B)

# Configure dual Sentinel monitoring (simulating IP reuse)
# Sentinel Group A (monitors Cluster A)
sentinel monitor cluster-a 127.0.0.1 6379 1
sentinel monitor cluster-a-slave 127.0.0.1 6381 1  # WRONG: monitoring Cluster B slave

# Sentinel Group B (monitors Cluster B)  
sentinel monitor cluster-b 127.0.0.1 6380 1
sentinel monitor cluster-b-slave 127.0.0.1 6381 1  # CORRECT

# Populate different data
redis-cli -p 6379 SET cluster-a-key "cluster-a-data"
redis-cli -p 6380 SET cluster-b-key "cluster-b-data"

# Trigger failover of Cluster A master
redis-cli -p 6379 DEBUG SEGFAULT

# BUG: Sentinel Group A will promote the slave containing Cluster B data
# Result: Cluster A now serves Cluster B data

Expected behavior

Sentinel should validate that a slave candidate is actually replicating from the failed master before promotion. The slave at port 6381 should be rejected for Cluster A failover because: 1. It's replicating from Cluster B master (127.0.0.1:6380), not Cluster A master (127.0.0.1:6379) 2. Its replication ID belongs to Cluster B, not Cluster A 3. Its data is from Cluster B, not Cluster A

Additional information

Real-World Impact from Production Incident

Timeline of Data Corruption

  • 12:46:21: Cluster B master fails, Sentinel promotes oscillating slave with Cluster A data
  • 12:46:22: Cluster B slaves reject sync due to replication ID mismatch
  • 12:47:33: Second failover promotes stable master, inheriting contaminated replication ID
  • 16:36: Manual data reconciliation required to restore service

Quantified Data Corruption

  • 8% of Cluster A keys found in Cluster B post-failover
  • Multiple Cluster B keys missing, including critical application data
  • 4+ hours of degraded service until manual reconciliation
  • Business impact: Significant service degradation

Evidence of Contaminated Replication Lineage

# Final Cluster B master showing Cluster A replication ID as secondary
redis-cli -p 6379 INFO replication
master_replid:b6c3b153655fe8ecbbc57bc2714f7e10ce051baf
master_replid2:39dea5d7ddb9b366bfba1e82ecb260e25935c167  # ← Cluster A ID!

Root Cause: Missing Validation in Sentinel

Current Sentinel slave selection criteria: 1. ✅ Slave priority (lower = better) 2. ✅ Replication offset (higher = more up-to-date)
3. ✅ Slave lag (lower = better connectivity) 4. ✅ Run ID (lexicographically smaller as tiebreaker)

Missing critical validations: 5. ❌ Replication source validation: Is slave replicating from the correct master? 6. ❌ Replication ID consistency: Does slave's replication ID match failed master? 7. ❌ Data lineage verification: Does slave contain expected cluster data?

Proposed Solution

Enhance Sentinel's slave selection logic to include replication lineage validation:

  1. Replication source validation: Verify slave is replicating from the correct master
  2. Replication ID consistency check: Ensure slave's replication ID matches the failed master
  3. Data lineage verification: Validate slave contains data from the expected cluster

This would prevent Sentinel from promoting slaves that contain data from different clusters.

Configuration to Prevent This Issue

Until fixed, prevent with proper Sentinel hygiene:

# 1. Clean up stale Sentinel configurations after IP changes
sentinel remove <old-master-name>
sentinel reset <pattern>

# 2. Monitor for dual-monitored instances
sentinel masters  # Check for duplicate IPs across masters

# 3. Add replication ID monitoring
redis-cli INFO replication | grep master_replid

This bug represents a critical gap in Redis Sentinel's validation logic that can cause catastrophic data corruption in production environments with IP address reuse scenarios.

How the Oscillation Occurred in Production

Cluster A Sentinel Logs Evidence

From Cluster A sentinel logs, we found evidence of the dual monitoring:

28 Apr 2025 16:11:50.268 * +slave slave <IP_ADDRESS>:6379 @ cluster-a-shard

This shows that during a rolling restart, the Cluster A Sentinels discovered the IP and added it as a slave to their cluster. This happened because:

  1. Stale Configuration: Old Cluster A Sentinel configs still referenced the reassigned IP
  2. Rolling Restart: During restart, Sentinels re-discovered topology and found the "new" slave
  3. No Validation: Sentinels didn't validate that this IP now belonged to a different cluster
  4. Dual Monitoring Established: Both sentinel groups monitored the same instance

The Oscillation Pattern

For 17 days, the Redis instance experienced conflicting commands:

# Cluster B Sentinels commanding: "Be a Cluster B slave"
REPLICAOF <cluster-b-master> 6379

# Cluster A Sentinels commanding: "Be a Cluster A slave"  
REPLICAOF <cluster-a-master> 6379

# Back and forth, creating data inconsistency over time

Critical Timing During Failover

At the moment when Cluster B master failed: - Oscillating instance was in "Cluster A mode": Contained Cluster A data - Cluster B Sentinels saw it as "available Cluster B slave": Due to stale monitoring config - No validation performed: Sentinels didn't check what data it actually contained - Wrong promotion occurred: Cluster A data became Cluster B master data

Why Current Sentinel Logic Failed

The current sentinelSelectSlave() function in Redis checks:

// Current validation (insufficient)
if (slave->flags & (SRI_DISCONNECTED|SRI_S_DOWN)) continue;
if (slave->link->last_avail_time < (mstime() - down_after_period * 5)) continue;
if (slave->slave_priority == 0) continue;

// Missing: What master is this slave actually replicating from?
// Missing: Does this slave's data belong to our cluster?

The bug is that Sentinel assumes any slave it monitors is a valid candidate for its cluster, without verifying the slave's actual replication source or data lineage.

Test Cases for Validation

  1. Normal failover: Slave with correct replication ID should be promoted
  2. Cross-cluster slave: Slave replicating from different cluster should be rejected
  3. IP reuse scenario: Dual-monitored slave should not be considered by wrong cluster
  4. Partial sync validation: Ensure promoted slave can serve partial syncs to existing slaves

Redis Version: 6.2.4
Environment: Production multi-cluster Redis deployment with Sentinel
Severity: Critical - Data corruption, extended outage, business impact

Comment From: zhijun42

I can't seem to reproduce the issue with your simplified test case. Specifically, these two lines would not work:

sentinel monitor cluster-a-slave 127.0.0.1 6381 1
sentinel monitor cluster-b-slave 127.0.0.1 6381 1

According to the syntax and instructions in sentinel.conf file, we can only do sentinel monitor against the master server.

# sentinel monitor <master-name> <ip> <redis-port> <quorum>
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least <quorum> sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Replicas are auto-discovered, so you don't need to specify replicas in
# any way. Sentinel itself will rewrite this configuration file adding
# the replicas using additional configuration options.