Hi, I've 3 nodes with a sentinel and a redis server each. The server "redis1" (the master, with IP: 10.178.7.101) was affected by hardware problems. In this point, the failover process starts but the there is no server that become the new master. The error was: failover-abort-not-elected
The weird thing is that both redis2 and redis3 voted to redis3 (7a32d16d2bcc7e940e516565ef2ed6d63c38b29d) as new leader and never the takeover complete
The logs are (all time are the same):
2019-01-09T16:52:27+01:00 7:S 09 Jan 15:52:27.549 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:27+01:00 8:S 09 Jan 15:52:27.172 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:26+01:00 8:X 09 Jan 15:52:26.813 # Next failover delay: I will not start a failover before Wed Jan 9 15:52:31 2019 2019-01-09T16:52:26+01:00 8:X 09 Jan 15:52:26.745 # -failover-abort-not-elected master mymaster 10.178.7.101 6379 2019-01-09T16:52:25+01:00 7:S 09 Jan 15:52:25.241 * MASTER <-> SLAVE sync started 2019-01-09T16:52:25+01:00 7:S 09 Jan 15:52:25.241 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:24+01:00 8:S 09 Jan 15:52:24.715 * MASTER <-> SLAVE sync started 2019-01-09T16:52:24+01:00 8:S 09 Jan 15:52:24.714 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:24+01:00 7:S 09 Jan 15:52:24.543 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:24+01:00 8:S 09 Jan 15:52:24.165 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:22+01:00 7:S 09 Jan 15:52:22.236 * MASTER <-> SLAVE sync started 2019-01-09T16:52:22+01:00 7:S 09 Jan 15:52:22.236 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:21+01:00 8:S 09 Jan 15:52:21.707 * MASTER <-> SLAVE sync started 2019-01-09T16:52:21+01:00 8:S 09 Jan 15:52:21.707 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:21+01:00 7:S 09 Jan 15:52:21.537 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:21+01:00 8:S 09 Jan 15:52:21.159 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.937 # Next failover delay: I will not start a failover before Wed Jan 9 15:52:31 2019 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.878 # d0971f1dec5ef8dd1a997949b17164ac6fc56eae voted for 7a32d16d2bcc7e940e516565ef2ed6d63c38b29d 366 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.875 # +vote-for-leader 7a32d16d2bcc7e940e516565ef2ed6d63c38b29d 366 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.872 # +new-epoch 366 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.869 # +vote-for-leader 7a32d16d2bcc7e940e516565ef2ed6d63c38b29d 366 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.868 # +try-failover master mymaster 10.178.7.101 6379 2019-01-09T16:52:20+01:00 8:X 09 Jan 15:52:20.868 # +new-epoch 366 2019-01-09T16:52:19+01:00 7:S 09 Jan 15:52:19.231 * MASTER <-> SLAVE sync started 2019-01-09T16:52:19+01:00 7:S 09 Jan 15:52:19.230 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:18+01:00 8:S 09 Jan 15:52:18.701 * MASTER <-> SLAVE sync started 2019-01-09T16:52:18+01:00 8:S 09 Jan 15:52:18.701 * Connecting to MASTER 10.178.7.101:6379 2019-01-09T16:52:18+01:00 7:S 09 Jan 15:52:18.531 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:18+01:00 8:S 09 Jan 15:52:18.153 # Error condition on socket for SYNC: No route to host 2019-01-09T16:52:16+01:00 8:X 09 Jan 15:52:16.569 # Next failover delay: I will not start a failover before Wed Jan 9 15:52:21 2019 2019-01-09T16:52:16+01:00 8:X 09 Jan 15:52:16.517 # -failover-abort-not-elected master mymaster 10.178.7.101 6379 .... .....
I would appreciate it if someone could help me because I am unable to reproduce this error manually and I do not know how to solve it next time. Thanks and kind regards.
Comment From: ArkayZheng
Hello @sanzio21
The reason is that your redis-sentinel cluster can not elect a new leader if only 2 rest. The sentinel failover process is a state machine. See void sentinelFailoverStateMachine(sentinelRedisInstance *ri)
in sentinel.c
.
And first of all, the sentinel will check if it is the leader. If no, the error will occur. See void sentinelFailoverWaitStart(sentinelRedisInstance *ri)
in sentinel.c
.
To solve the problem, just make sure that your sentinel cluster stays strong.
Comment From: sanzio21
Hi, I appreciate your interest and answer. Now, my doubt is why didn't switch the master to another node having two healthy nodes more, with quorum properly (2/3) and the consensus about the new master node.
I've been trying for several days repeat the error shutting off the nodes in different order, turning off the redis services, the sentinel services, etc, but always the master election works fine and switch to another node without problems.
Do you have any idea? Very thanks.
Comment From: vattezhang
I've encountered similar problems long time ago. My problem is caused by setting bind 127.0.0.0, when I change bind option to specific ip, it works fine. Hope this message can help.
Comment From: sanzio21
Hi, the bind is setting to 0.0.0.0. Nevertheless, I appreciate your interest. Thanks.
Comment From: Eder0ne
Hi, was the problem solved? I faced into same situation :(
Comment From: sanzio21
Hi, no, we never knew what happened, but the problem was never repeated. Thanks for the interest. Regards.
De: Eder0ne notifications@github.com Enviado: lunes, 20 de mayo de 2019 8:48 Para: antirez/redis Cc: sanzio21; Mention Asunto: Re: [antirez/redis] Sentinel failover aborted - failover-abort-not-elected error (#5784)
Hi, was the problem solved? I faced into same situation :(
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/antirez/redis/issues/5784?email_source=notifications&email_token=AKWEHW5LQKDUCXIOHNTA4XTPWJQUFA5CNFSM4GQKMU62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYDWVQ#issuecomment-493894486, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKWEHW4Q7FASRBECH7TG5GLPWJQUFANCNFSM4GQKMU6Q.
Comment From: Eder0ne
Thx for quick response! I couldn't find any docs about "-failover-abort-not-elected". Neither using search in source code.
Comment From: rgarrigue
We just had a similar issue. Slightly different, the two nodes left tried to elect a leader, but having the same epoch they were unable to choose a leader at the first time. I wish the documentation was clearer about some recommendation on config-epoch & leader-epoch settings
Comment From: xjplke
I have the same problem with version 4.0.14 And I deploy redis HA with example 2 mentiend in https://redis.io/topics/sentinel. Sometimes, it can work, but sometimes I get "-failover-abort-not-elected" in sentinel logs.
Comment From: adom4311
Hi, i have the same problem with version 5.0.3 and I know why this problem is occurred.
In my case, some sentinels are holding fake sentinel information. sentinels still have information that was previously linked. so, when master down and need to select sentinel leader , sentinel can't select leader.
so check sentinel information.
Comment From: sabretus
@adom4311 Thanks!
Just want to add - fix is to run sentinel reset *
on each sentinel but waiting at least 30 seconds between instances
Here is the related quote from documentation https://redis.io/topics/sentinel
Sentinels never forget already seen Sentinels, even if they are not reachable for a long time, since we don't want to dynamically change the majority needed to authorize a failover and the creation of a new configuration number.
Comment From: peoples21
Hi, I'm testing latest version 7.0.4 and I see that failover still doesn't work when redis and sentinel are stopped (killed) on the master node (-failover-abort-not-elected master mymaster).
Command sentinel reset *
on replicas doesn't work either.
Does it work for someone?
Thanks and regards.
Comment From: xdlsy
Hi, i have the same problem with version 5.0.3 and I know why this problem is occurred.
In my case, some sentinels are holding fake sentinel information. sentinels still have information that was previously linked. so, when master down and need to select sentinel leader , sentinel can't select leader.
so check sentinel information.
Thanks, it worked for me.