Crash report
We are running redis 5.0.4 with 4 master + 4 slave nodes and 40+ GiB data per node.
Two out of four slave nodes crash every single time when BGSAVE is executed and we never get rdb saved successfully since then. The crashes were memory related. Redis either tried to allocate insanely large memory, or faced NULL dict pointer.
For the former case, the large number, which redis tried to allocate as bytes, were actually a part of a key string. Say,
3472901146147829813 + 3 = 0x38 0x38 0x30 0x34 0x31 0x39 0x32 0x30 = "88041920"
It implies wrong sdslen(obj->ptr)
due to corrupted string object.
We are wondering what could be the reason or how to fix it. Thanks in advance.
Here go typical logs. Complete sample logs are attached.
4804:S 05 Nov 2024 02:57:10.830 # Background saving terminated by signal 11
4804:S 05 Nov 2024 08:46:51.041 * Background saving started by pid 13134
13134:C 05 Nov 2024 08:50:34.709 # Out Of Memory allocating 3472901146147829813 bytes!
...
13134:C 05 Nov 2024 08:50:34.710 # (forcing SIGSEGV in order to print the stack trace)
13134:C 05 Nov 2024 08:50:34.710 # ------------------------------------------------
13134:C 05 Nov 2024 08:50:34.710 # Redis 5.0.4 crashed by signal: 11
13134:C 05 Nov 2024 08:50:34.710 # Crashed running the instruction at: 0x46f103
13134:C 05 Nov 2024 08:50:34.710 # Accessing address: 0xffffffffffffffff
13134:C 05 Nov 2024 08:50:34.710 # Failed assertion: <no assertion failed> (<no file>:0)
29363:C 05 Nov 2024 03:11:00.862 # Redis 5.0.4 crashed by signal: 11
29363:C 05 Nov 2024 03:11:00.862 # Crashed running the instruction at: 0x42a820
29363:C 05 Nov 2024 03:11:00.862 # Accessing address: (nil)
29363:C 05 Nov 2024 03:11:00.862 # Failed assertion: <no assertion failed> (<no file>:0)
crash-1.log crash-2.log crash-3.log
Comment From: sundb
@bpint do you have keys larger than 2gb?
Comment From: bpint
noop. I don't think we have large keys or values. We are also very concerning about the data integrity right now.
Similar issuses are,
Some insparing investigations,
Comment From: sundb
@bpint can you share your executable file(redis-server) with me(debing.sun@redis.com), thanks.
Comment From: bpint
@sundb really sorry that it took so long to get the binary file from the production system.
the file was sent to your mailbox, and the crash log is,
Core was generated by `redis-rdb-bgsave 198.218.61.34:17382 [cl'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000442ed8 in sdssplitargs (line=0x3 <Address 0x3 out of bounds>, argc=0x0) at sds.c:1097
1097in sds.c
thanks a lot.
Comment From: zhuleiandy888
@bpint @sundb Did you find anything?