Redis [BUG] Accept failed between master and slave with TLS

We are getting below error continuously and during the replication between master/slave has broken.

Error accepting a client connection: (null) (addr=100.96.63.144:48974 laddr=100.96.11.72:6379) Error accepting a client connection: (null) (addr=100.96.63.144:48976 laddr=100.96.11.72:6379)

TLS has enabled here. Even verified client certificates and its good. Redis version is 7.2.4

Could you please share what are the probable cause of these errors. Why it is showing error as null

Comment From: sundb

@SarthakSahu how did you change the config for master and slave?

Comment From: SarthakSahu

Hi @sundb sorry couldn't get the config you are referring to. And the Accept is not failing between master and slave. Instead it is failing between client (100.96.63.144) and redis server(100.96.11.72).

Comment From: sundb

@SarthakSahu thx, did you have the correct certificate for client?

Comment From: SarthakSahu

Yes certificates are fine. We have done a curl with certificates and received empty result.

And I just learnt now, the client is doing a sscan on a set with a pattern where the client is getting "context deadline exceeded". There after the problem started.

But the requested key is not available in the db.

Comment From: sundb

@SarthakSahu I'm a bit confused. Since the client has already received an empty reply, it indicates that the accept has been successful. But you only saw this log after your program showed "context deadline exceeded"? Was it due to a timeout that the client disconnected and then reconnected?

Comment From: SarthakSahu

@sundb I think you have mixed.

To verify the certificate, we have used curl command with the certificates, and received a empty response. This tell the certificate is fine.
Yes the client program say context deadline exceeded Unable to scan due to timeout then tried to run the sscan again. We have only a single connection to redis server which will be re-used. After couple of retry, the server pod losing is desired role. The next sscan, the error is CLUSTERDOWN The cluster is down Unable to scan

Comment From: sundb

@SarthakSahu I want to clarify the following: 1. How did you use the sscan command? Did you use a large count option. 2. Is the node down caused by sscan?

Comment From: SarthakSahu

Hi @sundb
1. we use sscan command with key and pattern. 2. Yes we are suspecting the same but not sure because as soon as we bring down the client application, the cluster recovers.

Comment From: sundb

@SarthakSahu what the size of the set and what is the COUNT(default 10) option you use? Please check if there are any relevant records in slowlog. The overhead of using patterns is quite high. if the count is always greater than the number of members that can be found, which means that we always traverse all the memebers each time.

Yes we are suspecting the same but not sure because as soon as we bring down the client application, the cluster recovers.

it's reasonable, the node will be discovered when this node is avaliable.

Comment From: SarthakSahu

@sundb The COUNT is 500 here. But why it is impacting cluster topology and not responsive. Can you suggest how to mitigate it.

Comment From: sundb

@SarthakSahu If a node gets stuck, it may cause other nodes to think it has gone down. You can reduce the count (you can use the default 10) to avoid the entire node being unavailable due to a single request. otherwise, you can increase the config cluster-node-timeout.

Comment From: SarthakSahu

Hi @sundb , Redis data base has few sets with around 50k entries. We have deleted those. And the client query has been corrupted. After fixing those the redis cluster become healty.

Do we have a dimensioning for sscan command.

Comment From: sundb

Hi @sundb , Redis data base has few sets with around 50k entries. We have deleted those. And the client query has been corrupted. After fixing those the redis cluster become healty.

Do we have a dimensioning for sscan command.

There is no standard answer. Anyway, it should be noted that pattern is not fast, and should not to use the large count option.