because of https://github.com/redis/redis/pull/11785 in 6.2.11 node while loading rdb mark connection to other cluster nodes as inactive and not being able to finish sync. it do it again and again, never ending cycle and eventually getting protocol error (but this is another case). If i raise cluster_node_timeout 1000 times in config (eliminating the patch), node is able to sync, but this parameter is used in many places in cluster.c and ordinary clients not timeout-ed and client max connection is reached very fast.

The issue in general is in missing keepalives while loading, retrieving full shapshot by slave from master. In the log on slave I see alot of "Address updated for node #nodeid, now ?:6379" with 6.2.11, with 6.2.10 i see too, but only few (2-5 lines). Each node in my cluster contains about 50gb, Loading rdb or aof takes ~ 8 minutes, bgsave 8-9 minutes. On the master I see in the log "FAIL message received from" from some other node about that slave.

Comment From: enjoy-binbin

@soloestoy please take a look

Comment From: soloestoy

tcp keep-alive packet doesn't trigger I/O event, it works in the transport level not the app level, so this is not the keep-alive's issue. you can check if cluster-node-timeout config is too small, or you can revert the commit to test and confirm.

Comment From: gzivdo

I know that its only tcp keep alive socket option. Possibly imagined behavior by me is wrong, But it affect 100% somehow. I reverted only that patch in 6.2.11 and slave node is loading fine now. cluster-node-timeout 5000. Also I found that compressrdb yes has a strong influence and is possibly related with issue https://github.com/redis/redis/issues/10073 too

With that patch and "compressrdb yes" all cluster nodes is forgotten right after loading of retrieved rdb ( "Address updated for node #nodeid, now ?:6379" ). Slave is not able to start, it start resyncing. Without this patch with "compressrdb yes" only 3-5 such messages. slave is synced and started fine. Its constant, I tested about 7 times both cases. Without this patch and "compressrdb no" there is no such messages at all.

P.S. Clusters with 2-3gb run fine in my environment, but this big cluster is not with that change. P.P.S. I tried upgrade a slave, master was 6.2.10. kernel 5.4

Comment From: nantiferov

Observing similar behaviour with big clusters (>7Gb) 3 masters, 3 replicas. Redis v6.2.13, official deb packages on Ubuntu 22.04, arm64.

I think this might help to reproduce: * create cluster with 6 nodes (3 masters, 3 replicas) * fill each node with 7gb or more data * restart replica * with big probability replica won't be able to sync with error like == CRITICAL == This replica is sending an error to its master: 'Protocol error: too big inline request' after processing the command '<unknown>' * sometimes though it successfully finishes (~10% of cases)

Comment From: nantiferov

@soloestoy , @enjoy-binbin so what is the plan with this issue?

Apparently this change https://github.com/redis/redis/pull/11785 made redis cluster unusable with big amounts of data.

Should everyone with such clusters create their own packages with this patch removed? Or stay on 6.2.10/7.0.8 forever?

Comment From: enjoy-binbin

https://github.com/redis/redis/issues/12001#issuecomment-1743066121

so in this comment steps, it can be reproduce? I will take a deep look tomorrow and check it.

Comment From: nantiferov

so in this comment steps, it can be reproduced?

Yep, I initially found this stress testing new version (6.2.14) and it's reproducible with >90% or so I would say.

I will take a deep look tomorrow and check it.

Thank you!

Comment From: enjoy-binbin

@nantiferov sorry, i am not able to reproduce it. (my guess is that my data set is not enough, i am short on memory, i only be able to build a 7GB master-slave (3 master, 3 slaves), i only have one machine. But accordingly, I adjusted cluster-node-timeout to 1000 or a smaller value of 100. PS with cluster-node-timeout=100, keepalive value will become 0)

my server logs (adding some logs to print the keepalive value and accept info):

22374:S 18 Feb 2024 18:08:08.909 # User requested shutdown...
22374:S 18 Feb 2024 18:08:08.909 * Removing the pid file.
22374:S 18 Feb 2024 18:08:08.909 # Redis is now ready to exit, bye bye...
30152:C 18 Feb 2024 18:08:09.356 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
30152:C 18 Feb 2024 18:08:09.356 # Redis version=6.2.13, bits=64, commit=24c0bf5c, modified=1, pid=30152, just started
30152:C 18 Feb 2024 18:08:09.356 # Configuration loaded
30152:M 18 Feb 2024 18:08:09.356 * Increased maximum number of open files to 10032 (it was originally set to 1024).
30152:M 18 Feb 2024 18:08:09.356 * monotonic clock: POSIX clock_gettime
30152:M 18 Feb 2024 18:08:09.357 * Node configuration loaded, I'm 83f8fcb7742ff91a49f85bd0b1c9abf9f2b07e66
30152:M 18 Feb 2024 18:08:09.357 * Running mode=cluster, port=30001.
30152:M 18 Feb 2024 18:08:09.357 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
30152:M 18 Feb 2024 18:08:09.357 # Server initialized
30152:M 18 Feb 2024 18:08:09.358 * Ready to accept connections
30152:S 18 Feb 2024 18:08:09.358 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
30152:S 18 Feb 2024 18:08:09.358 * Connecting to MASTER 127.0.0.1:30004
30152:S 18 Feb 2024 18:08:09.358 * MASTER <-> REPLICA sync started
30152:S 18 Feb 2024 18:08:09.358 # Cluster state changed: ok
30152:S 18 Feb 2024 18:08:09.358 * Non blocking connect for SYNC fired the event.
30152:S 18 Feb 2024 18:08:09.358 * Master replied to PING, replication can continue...
30152:S 18 Feb 2024 18:08:09.358 * Trying a partial resynchronization (request 92770ae934cf91964ba8a8badb8f34a75ca6c62f:1).
30152:S 18 Feb 2024 18:08:09.423 # keep alive value: 0
30152:S 18 Feb 2024 18:08:09.423 # Accepting cluster node connection from 127.0.0.1:52108
30152:S 18 Feb 2024 18:08:09.427 # keep alive value: 0
30152:S 18 Feb 2024 18:08:09.427 # Accepting cluster node connection from 127.0.0.1:52110
30152:S 18 Feb 2024 18:08:09.431 # keep alive value: 0
30152:S 18 Feb 2024 18:08:09.431 # Accepting cluster node connection from 127.0.0.1:52112
30152:S 18 Feb 2024 18:08:09.436 # keep alive value: 0
30152:S 18 Feb 2024 18:08:09.436 # Accepting cluster node connection from 127.0.0.1:52114
30152:S 18 Feb 2024 18:08:09.466 * Full resync from master: afde72180cb210c53b9b833dc7cb009b67b70ec0:7433701806
30152:S 18 Feb 2024 18:08:09.466 * Discarding previously cached master state.
30152:S 18 Feb 2024 18:08:09.466 # keep alive value: 0
30152:S 18 Feb 2024 18:08:09.466 # Accepting cluster node connection from 127.0.0.1:52116
...
a lot keep alive and accept logs
30152:S 18 Feb 2024 18:09:40.706 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.706 # Accepting cluster node connection from 127.0.0.1:53392
30152:S 18 Feb 2024 18:09:40.706 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.706 # Accepting cluster node connection from 127.0.0.1:53394
30152:S 18 Feb 2024 18:09:40.706 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.706 # Accepting cluster node connection from 127.0.0.1:53396
30152:S 18 Feb 2024 18:09:40.706 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.706 # Accepting cluster node connection from 127.0.0.1:53398
30152:S 18 Feb 2024 18:09:40.706 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.707 # Accepting cluster node connection from 127.0.0.1:53400
30152:S 18 Feb 2024 18:09:40.707 # keep alive value: 0
30152:S 18 Feb 2024 18:09:40.707 # Accepting cluster node connection from 127.0.0.1:53402
30152:S 18 Feb 2024 18:09:40.707 # keep alive value: 0
...
30152:S 18 Feb 2024 18:10:01.566 * MASTER <-> REPLICA sync: Flushing old data
30152:S 18 Feb 2024 18:10:01.566 * MASTER <-> REPLICA sync: Loading DB in memory
30152:S 18 Feb 2024 18:10:01.950 * Loading RDB produced by version 6.2.13
30152:S 18 Feb 2024 18:10:01.950 * RDB age 112 seconds
30152:S 18 Feb 2024 18:10:01.950 * RDB memory usage when created 7462.07 Mb
30152:S 18 Feb 2024 18:10:01.966 # keep alive value: 0
30152:S 18 Feb 2024 18:10:01.966 # Accepting cluster node connection from 127.0.0.1:54346
30152:S 18 Feb 2024 18:10:01.966 # keep alive value: 0
30152:S 18 Feb 2024 18:10:01.966 # Accepting cluster node connection from 127.0.0.1:54348
30152:S 18 Feb 2024 18:10:01.966 # keep alive value: 0
30152:S 18 Feb 2024 18:10:01.966 # Accepting cluster node connection from 127.0.0.1:54350
30152:S 18 Feb 2024 18:10:01.966 # keep alive value: 0
30152:S 18 Feb 2024 18:10:01.966 # Accepting cluster node connection from 127.0.0.1:54352
30152:S 18 Feb 2024 18:10:01.966 # keep alive value: 0
30152:S 18 Feb 2024 18:10:01.966 # Accepting cluster node connection from 127.0.0.1:54354
30152:S 18 Feb 2024 18:10:02.036 # keep alive value: 0
30152:S 18 Feb 2024 18:10:02.036 # Accepting cluster node connection from 127.0.0.1:54356
30152:S 18 Feb 2024 18:10:02.036 # keep alive value: 0
30152:S 18 Feb 2024 18:10:02.036 # Accepting cluster node connection from 127.0.0.1:54358
30152:S 18 Feb 2024 18:10:02.036 # keep alive value: 0
30152:S 18 Feb 2024 18:10:02.036 # Accepting cluster node connection from 127.0.0.1:54360
30152:S 18 Feb 2024 18:10:02.048 # keep alive value: 0
30152:S 18 Feb 2024 18:10:02.048 # Accepting cluster node connection from 127.0.0.1:54362
30152:S 18 Feb 2024 18:10:02.048 # keep alive value: 0
30152:S 18 Feb 2024 18:10:02.048 # Accepting cluster node connection from 127.0.0.1:54364

30152:S 18 Feb 2024 18:10:30.654 # Done loading RDB, keys loaded: 8555939, keys expired: 0.
30152:S 18 Feb 2024 18:10:30.654 * MASTER <-> REPLICA sync: Finished with success

@soloestoy @oranagra Do you have any other insights?

Comment From: nantiferov

Hi, Sorry for delay, was away last week.

I'll try to reproduce this again and add details.

So far I can guess that it wasn't reproduced in your case due to all redises were running on one host, so network latency was minimal comparing to 3 EC2 across different AZ.

Comment From: oranagra

@madolson PATL

Comment From: Damned01

Hi, I have the same problem on Redis 7.2.4 in Kubernetes with the latest Bitnami helm chart.

My RDB is 70GB and once loaded in RAM I get 160GB.

Everything seems to work well, the RDB is created on the master, transferred to the SLAVE and loaded into RAM. Then the pod restarts and does the same thing again.

What is strange is that if I delete the dump.rdb and the aof files the SLAVE manages to synchronize.

The problem is that if the MASTER fails and it becomes SLAVE it never going to synchronize unless I delete the files.

If you have an idea for a setting that I could test to remedy this I would be very grateful. If it really is this bug I hope it will be corrected in view of the latest news...

Comment From: nantiferov

Hi @enjoy-binbin

Sorry for delay, I think I was able to find how to reproduce this issue reliably.

Looks like there are at least 2 prerequisites: * Big enough amount of data on node (something like 7Gb or more) * cluster itself is under load. I used redis-benchmark with these parameters: redis-benchmark --cluster -l -r 1024 Then replica restart causes endless loop when trying to sync from master with Protocol error (Master using the inline protocol. Desync?) error.

Logs from replica at this moment:

883:S 09 Apr 2024 09:46:56.879 * MASTER <-> REPLICA sync: Loading DB in memory
883:S 09 Apr 2024 09:46:56.881 * Loading RDB produced by version 6.2.13
883:S 09 Apr 2024 09:46:56.881 * RDB age 271 seconds
883:S 09 Apr 2024 09:46:56.881 * RDB memory usage when created 14334.92 Mb
883:S 09 Apr 2024 09:46:57.328 # Address updated for node f2f27672857da30cfadbf1d02818dd13f5762e1f, now ?:6378
883:S 09 Apr 2024 09:46:57.328 # Address updated for node 308eef4a27d6f8830ea0f9ca94f878113cb5c943, now ?:6379
883:S 09 Apr 2024 09:46:57.328 * Connecting to MASTER ?:6379
883:S 09 Apr 2024 09:46:57.328 # Unable to connect to MASTER: No such file or directory
883:S 09 Apr 2024 09:46:57.328 # Address updated for node db21a01e95a8c7b022f5c453ca074ddc604663de, now ?:6378
883:S 09 Apr 2024 09:46:57.328 # Address updated for node f764916bb69b8371f04bf25f71c072823060cec9, now ?:6379
883:S 09 Apr 2024 09:46:57.328 # Address updated for node ce59a1977397576ce9b4fd937ab648b2cf9f9cde, now ?:6379
883:S 09 Apr 2024 09:46:57.329 # Address updated for node f2f27672857da30cfadbf1d02818dd13f5762e1f, now 172.28.222.118:6378
883:S 09 Apr 2024 09:46:57.329 # Address updated for node 308eef4a27d6f8830ea0f9ca94f878113cb5c943, now 172.28.222.118:6379
883:S 09 Apr 2024 09:46:57.329 * Connecting to MASTER 172.28.222.118:6379
883:S 09 Apr 2024 09:46:57.330 * MASTER <-> REPLICA sync started
883:S 09 Apr 2024 09:46:57.330 # Address updated for node db21a01e95a8c7b022f5c453ca074ddc604663de, now 172.28.207.138:6378
883:S 09 Apr 2024 09:46:57.330 # Address updated for node f764916bb69b8371f04bf25f71c072823060cec9, now 172.28.207.138:6379
883:S 09 Apr 2024 09:46:57.330 # Address updated for node ce59a1977397576ce9b4fd937ab648b2cf9f9cde, now 172.28.226.247:6379
883:S 09 Apr 2024 09:46:57.331 # Address updated for node ce59a1977397576ce9b4fd937ab648b2cf9f9cde, now ?:6379
883:S 09 Apr 2024 09:46:57.331 # Address updated for node f2f27672857da30cfadbf1d02818dd13f5762e1f, now ?:6378
883:S 09 Apr 2024 09:46:57.331 # Address updated for node 308eef4a27d6f8830ea0f9ca94f878113cb5c943, now ?:6379
883:S 09 Apr 2024 09:46:57.331 * Connecting to MASTER ?:6379
883:S 09 Apr 2024 09:46:57.331 # Unable to connect to MASTER: Transport endpoint is not connected
883:S 09 Apr 2024 09:46:57.331 # Address updated for node db21a01e95a8c7b022f5c453ca074ddc604663de, now ?:6378
883:S 09 Apr 2024 09:46:57.331 # Address updated for node f764916bb69b8371f04bf25f71c072823060cec9, now ?:6379
883:S 09 Apr 2024 09:46:57.331 # Address updated for node f2f27672857da30cfadbf1d02818dd13f5762e1f, now 172.28.222.118:6378
883:S 09 Apr 2024 09:46:57.331 # Address updated for node db21a01e95a8c7b022f5c453ca074ddc604663de, now 172.28.207.138:6378
883:S 09 Apr 2024 09:46:57.331 # Address updated for node 308eef4a27d6f8830ea0f9ca94f878113cb5c943, now 172.28.222.118:6379
883:S 09 Apr 2024 09:46:57.331 * Connecting to MASTER 172.28.222.118:6379
883:S 09 Apr 2024 09:46:57.331 * MASTER <-> REPLICA sync started
883:S 09 Apr 2024 09:46:57.331 * FAIL message received from 308eef4a27d6f8830ea0f9ca94f878113cb5c943 about f2f27672857da30cfadbf1d02818dd13f5762e1f
883:S 09 Apr 2024 09:46:57.332 # Address updated for node f764916bb69b8371f04bf25f71c072823060cec9, now 172.28.207.138:6379
883:S 09 Apr 2024 09:46:57.332 # Address updated for node ce59a1977397576ce9b4fd937ab648b2cf9f9cde, now 172.28.226.247:6379
883:S 09 Apr 2024 09:46:57.332 * Non blocking connect for SYNC fired the event.
883:S 09 Apr 2024 09:46:57.520 * Master replied to PING, replication can continue...
883:S 09 Apr 2024 09:46:57.694 * Partial resynchronization not possible (no cached master)
883:S 09 Apr 2024 09:47:02.325 * Full resync from master: c250aed920a545e204f692e5e73e9a809b212992:1313811083
883:S 09 Apr 2024 09:47:02.743 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
883:S 09 Apr 2024 09:51:17.289 # Done loading RDB, keys loaded: 98596866, keys expired: 0.
883:S 09 Apr 2024 09:51:17.291 * MASTER <-> REPLICA sync: Finished with success
883:S 09 Apr 2024 09:51:17.292 # WARNING: Receiving inline protocol from master, master stream corruption? Closing the master connection and discarding the cached master.
883:S 09 Apr 2024 09:51:17.292 # Protocol error (Master using the inline protocol. Desync?) from client: id=26 addr=172.28.222.118:6379 laddr=172.28.226.247:39675 fd=9 name= age=0 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=16384 qbuf-free=24570 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61440 events=r cmd=NULL user=(superuser) redir=-1. Query buffer during protocol error: 'rnitwlErK1yFM.Value29759.|.' (... more 16256 bytes ...) '.w'
883:S 09 Apr 2024 09:51:17.362 * Clear FAIL state for node f2f27672857da30cfadbf1d02818dd13f5762e1f: replica is reachable again.
883:S 09 Apr 2024 09:51:17.993 # Connection with master lost.
883:S 09 Apr 2024 09:51:17.993 * Reconnecting to MASTER 172.28.222.118:6379
883:S 09 Apr 2024 09:51:17.993 * MASTER <-> REPLICA sync started
883:S 09 Apr 2024 09:51:17.994 * Non blocking connect for SYNC fired the event.
883:S 09 Apr 2024 09:51:17.995 * Master replied to PING, replication can continue...
883:S 09 Apr 2024 09:51:17.996 * Partial resynchronization not possible (no cached master)
883:S 09 Apr 2024 09:51:22.327 * Full resync from master: c250aed920a545e204f692e5e73e9a809b212992:1313811447
883:S 09 Apr 2024 09:51:22.768 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
883:S 09 Apr 2024 09:53:06.547 * MASTER <-> REPLICA sync: Flushing old data
883:S 09 Apr 2024 09:56:03.777 * MASTER <-> REPLICA sync: Loading DB in memory
883:S 09 Apr 2024 09:56:03.778 * Loading RDB produced by version 6.2.13
883:S 09 Apr 2024 09:56:03.778 * RDB age 281 seconds
883:S 09 Apr 2024 09:56:03.778 * RDB memory usage when created 14334.92 Mb

Master logs:

499:M 09 Apr 2024 09:45:58.972 * Marking node f2f27672857da30cfadbf1d02818dd13f5762e1f as failing (quorum reached).
499:M 09 Apr 2024 09:46:57.328 # Connection with replica 172.28.226.247:6378 lost.
499:M 09 Apr 2024 09:46:57.333 * Clear FAIL state for node 386ad5298d9b994bde5537d0a485363b79a79ef3: replica is reachable again.
499:M 09 Apr 2024 09:46:57.694 * Replica 172.28.226.247:6378 asks for synchronization
499:M 09 Apr 2024 09:46:57.694 * Full resync requested by replica 172.28.226.247:6378
499:M 09 Apr 2024 09:46:57.694 * Delay next BGSAVE for diskless SYNC
499:M 09 Apr 2024 09:47:02.224 * Starting BGSAVE for SYNC with target: replicas sockets
499:M 09 Apr 2024 09:47:02.442 * Background RDB transfer started by pid 1447
499:M 09 Apr 2024 09:48:47.123 * Clear FAIL state for node f2f27672857da30cfadbf1d02818dd13f5762e1f: replica is reachable again.
499:M 09 Apr 2024 09:51:17.994 # Connection with replica client id #20559 lost.
499:M 09 Apr 2024 09:51:17.994 # Diskless rdb transfer, last replica dropped, killing fork child.
1447:signal-handler (1712656277) Received SIGUSR1 in child, exiting now.
499:M 09 Apr 2024 09:51:17.996 * Replica 172.28.226.247:6378 asks for synchronization
499:M 09 Apr 2024 09:51:17.996 * Full resync requested by replica 172.28.226.247:6378
499:M 09 Apr 2024 09:51:17.996 * Current BGSAVE has socket target. Waiting for next BGSAVE for SYNC
499:M 09 Apr 2024 09:51:18.314 # Background transfer terminated by signal 10
499:M 09 Apr 2024 09:51:22.326 * Starting BGSAVE for SYNC with target: replicas sockets
499:M 09 Apr 2024 09:51:22.544 * Background RDB transfer started by pid 1451
1451:C 09 Apr 2024 09:53:06.764 * RDB: 1 MB of memory used by copy-on-write
499:M 09 Apr 2024 09:53:06.764 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
499:M 09 Apr 2024 09:53:07.145 * Background RDB transfer terminated with success
499:M 09 Apr 2024 09:53:07.145 * Streamed RDB transfer with replica 172.28.226.247:6378 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
499:M 09 Apr 2024 09:53:13.779 * FAIL message received from f2f27672857da30cfadbf1d02818dd13f5762e1f about 386ad5298d9b994bde5537d0a485363b79a79ef3
499:M 09 Apr 2024 09:55:07.126 * Marking node f2f27672857da30cfadbf1d02818dd13f5762e1f as failing (quorum reached).
499:M 09 Apr 2024 09:56:04.252 # Connection with replica 172.28.226.247:6378 lost.
499:M 09 Apr 2024 09:56:04.265 * Clear FAIL state for node 386ad5298d9b994bde5537d0a485363b79a79ef3: replica is reachable again.
499:M 09 Apr 2024 09:56:04.669 * Replica 172.28.226.247:6378 asks for synchronization
499:M 09 Apr 2024 09:56:04.669 * Full resync requested by replica 172.28.226.247:6378
499:M 09 Apr 2024 09:56:04.669 * Delay next BGSAVE for diskless SYNC
499:M 09 Apr 2024 09:56:09.403 * Starting BGSAVE for SYNC with target: replicas sockets
499:M 09 Apr 2024 09:56:09.616 * Background RDB transfer started by pid 1491
499:M 09 Apr 2024 09:57:54.370 * Clear FAIL state for node f2f27672857da30cfadbf1d02818dd13f5762e1f: replica is reachable again.

And it continues in the loop, even after I stopped redis-benchmark.

After downgrade to 6.2.10 and restart replica was able to sync without issues, logs:

2622:M 09 Apr 2024 10:05:08.648 # Server initialized
2622:M 09 Apr 2024 10:05:08.650 * Loading RDB produced by version 6.2.13
2622:M 09 Apr 2024 10:05:08.650 * RDB age 287 seconds
2622:M 09 Apr 2024 10:05:08.650 * RDB memory usage when created 14334.92 Mb
2622:M 09 Apr 2024 10:09:11.724 # Done loading RDB, keys loaded: 98596866, keys expired: 0.
2622:M 09 Apr 2024 10:09:11.724 * DB loaded from disk: 243.074 seconds
2622:M 09 Apr 2024 10:09:11.724 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
2622:M 09 Apr 2024 10:09:11.724 * Ready to accept connections
2622:S 09 Apr 2024 10:09:11.743 * Discarding previously cached master state.
2622:S 09 Apr 2024 10:09:11.743 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
2622:S 09 Apr 2024 10:09:11.743 * Connecting to MASTER 172.28.222.118:6379
2622:S 09 Apr 2024 10:09:11.743 * MASTER <-> REPLICA sync started
2622:S 09 Apr 2024 10:09:11.743 # Cluster state changed: ok
2622:S 09 Apr 2024 10:09:11.747 * Non blocking connect for SYNC fired the event.
2622:S 09 Apr 2024 10:09:11.747 * Master replied to PING, replication can continue...
2622:S 09 Apr 2024 10:09:11.749 * Trying a partial resynchronization (request c250aed920a545e204f692e5e73e9a809b212992:1313812204).
2622:S 09 Apr 2024 10:09:11.749 * Successful partial resynchronization with master.
2622:S 09 Apr 2024 10:09:11.749 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.

NOTE: This replica id is 386ad5298d9b994bde5537d0a485363b79a79ef3, its master id is 308eef4a27d6f8830ea0f9ca94f878113cb5c943. There is another replica failing in the logs, id f2f27672857da30cfadbf1d02818dd13f5762e1f.

Comment From: ericsmalling

Any updates on this?

Comment From: xnox

Given there seems to be a reproducer above, has anybody attempted bisect to pin-point when the regression happened?

Comment From: nantiferov

It's mentioned in the initial post that reason is this PR: https://github.com/redis/redis/pull/11785

Last working versions: 6.2.10 and 7.0.8

Comment From: xnox

Proposing consistent Keep Alive handling in https://github.com/redis/redis/pull/13308 as well as ability to turn it off, if it is indeed the route cause of certain deployments.

I.e. some may need to set higher or lower tcp-keepalive setting, or turn it off completely by setting to zero, which with the above patch should be now possible.

Comment From: nantiferov

Update: I tried to reproduce this bug with Redis 7.4.0 without success, using same approach I tried before with 6.2.13.

Tried 6 nodes cluster, each with 24Gb maxmemory limit and 22Gb of data filled. Each time replica was able to sync back without this error, no matter if I cleanup bgsave dump or not.

Comment From: sureshvasanthkumar

Any update on this. We are still facing this issue. Any idea when/which version this would be fixed on?

Comment From: pnivlek

Hi, we have also hit this error on 6.2.18 on a cluster with 24 32Gb nodes and an RDB memory usage of 16Gb. Is there a fix coming soon, possibly with #13308 ?

Comment From: xnox

Signed CLA in case people want to merge my pull request. It is real and seems to fix this for me.

Comment From: Justinwins

any update now? i also met this, which highly looks like what nantiferov reproduced before. my version is 6.2.12, 30gb for a node , 3 masters 3 slaves.

i think it's really a bug. @gzivdo @enjoy-binbin @nantiferov

Comment From: sundb

@Justinwins what's the cluster-node-timeout config you are using?

Comment From: Justinwins

@sundb 10s ,and the complete redis slave looks like below

bind 0.0.0.0 protected-mode yes port 6380 tcp-backlog 511 timeout 0 tcp-keepalive 300 daemonize no supervised no pidfile /data/redis-cluster/cluster/pid/6380.pid loglevel notice logfile /data/redis-cluster/cluster/log/redis-cluster-6380.log cluster-enabled yes cluster-config-file /data/redis-cluster/cluster/conf/redis-cluster-6380.conf cluster-node-timeout 10000 databases 16 always-show-logo yes save 900 1 save 300 10 save 60 10000 stop-writes-on-bgsave-error yes rdbcompression yes rdbchecksum yes dbfilename dump.rdb dir /data/redis-cluster/cluster/data/6380/ replica-serve-stale-data yes replica-read-only yes repl-diskless-sync no repl-diskless-sync-delay 5 repl-disable-tcp-nodelay no replica-priority 100

requirepass *

masterauth * aclfile /data/redis-cluster/cluster/conf/aclfile.acl lazyfree-lazy-eviction no lazyfree-lazy-expire no lazyfree-lazy-server-del no replica-lazy-flush no appendonly yes appendfilename "appendonly.aof" appendfsync everysec no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes aof-use-rdb-preamble yes lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 128 latency-monitor-threshold 0 notify-keyspace-events "" hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-size -2 list-compress-depth 0 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 stream-node-max-bytes 4096 stream-node-max-entries 100 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit replica 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 dynamic-hz yes aof-rewrite-incremental-fsync yes rdb-save-incremental-fsync yes maxmemory 64gb maxmemory-policy volatile-lru

Comment From: sundb

@Justinwins thx, do you have the reproduce steps? let me try if i can reproduce it in my local.

Comment From: Justinwins

thx , but @nantiferov had already mentioned how to reproduce in his local .we met the same problem ,i think. but the way ,i did not reproduce this with 6.2.14 ! i am comparing 6.2.12 with 6.2.14 .

Comment From: starrywenjian

I encountered the same issue. It might be caused by low disk performance leading to buffer overflow. The problem was resolved by lifting the output buffer limit on the master node. Execute config set client-output-buffer-limit "slave 0 0 0" on the master node and restart the slave node, which should fix the issue. By the way, my redis version is 6.2.19

Comment From: Justinwins

@starrywenjian 1、 so it's a bug or not ?
2、what's the memeory usage of your redis node (per node)? 3、you walkaround this issue by config set client-output-buffer-limit "slave 0 0 0" ? 4、you have any idea about the root cause (at the sourcecode level)?

thx.

Comment From: dinesh-murugiah

hi i have a query regarding this issue as per my understanding the TCP keep alive is a linux kernel level handshake , to close stale connections lets assume if the replicas socket is busy getting data from master due to full sync , is the assumption that it misses the TCP keep alive sent on the cluster BUS and the peer closes the connection?

i don't think thats a likely thing to happen , unless the cluster bus socket in linux kernel is messed up

this is what i don't get , as the cluster bus has a separate socket and i don't understand why it has to be impacted by client socket busy with RDB

i think possibly this piece of code

if (node->link && / is connected / now - node->link->ctime > server.cluster_node_timeout && / was not already reconnected / node->ping_sent && / we already sent a ping / / and we are waiting for the pong more than timeout/2 / ping_delay > server.cluster_node_timeout/2 && / and in such interval we are not seeing any traffic at all. / data_delay > server.cluster_node_timeout/2) { / Disconnect the link, it will be reconnected automatically. / freeClusterLink(node->link); }

forces the clutsrer link to be broken due to pressure in epoll happening in main thread and not allowing cluster cron to process messages properly

Redis in big cluster redis slave node is not able to sync since 6.2.11

requirepass *