Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
while True:
dr = pd.date_range(start='1/1/2019', end='1/2/2019', freq='s', tz='UTC')
df = pd.DataFrame({'col1': [12.34]}, index=dr)
df.reset_index().to_json(orient='values', date_format='iso')
Issue Description
There appears to be a memory leak in Pandas to_json()
when converting DateTime values. When running the reproducer, the system's memory use will continuously increase.
Expected Behavior
Memory use should be stable.
Installed Versions
Comment From: Alvaro-Kothe
I was able to reproduce the memory leak and noticed it doesn't occur when reset_index()
is removed.
Comment From: swt2c
Yeah the reset_index()
is needed in this particular example because otherwise, the datetime64[ns, UTC]
column won't be rendered by to_json()
.
The memory leak seems to occur when to_json()
is handling datetime64[ns, UTC]
values.
Comment From: swt2c
Hi @Alvaro-Kothe and @mroeschke, thank you for the fixes. Unfortunately, they did not fix the memory leak that I'm seeing. Can someone reopen this issue please?
The reproducer shown in #62210 unfortunately is slightly different than my use case. Tweaking it to the below will show the issue that I'm still experiencing (note the date_format='iso'
).
import pandas as pd
for _ in range(10_000):
df = pd.DataFrame({'col1': [12.34]},
index=pd.date_range('1/1/2019', '10/1/2019', freq="D", tz="UTC"))
result = df.reset_index().to_json(date_format='iso')
I investigated the issue myself with valgrind and it shows the following:
==4137== 1,150,800 bytes in 27,400 blocks are definitely lost in loss record 17,727 of 17,727
==4137== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==4137== by 0x85F5067: PyDateTimeToIso (pd_datetime.c:173)
==4137== by 0x12B0F29B: PyDateTimeToIsoCallback (objToJSON.c:350)
==4137== by 0x12B13B1F: Object_getStringValue (objToJSON.c:1891)
==4137== by 0x12B1664E: encode (ultrajsonenc.c:1084)
==4137== by 0x12B1635A: encode (ultrajsonenc.c:1027)
==4137== by 0x12B1635A: encode (ultrajsonenc.c:1027)
==4137== by 0x12B16C70: JSON_EncodeObject (ultrajsonenc.c:1195)
==4137== by 0x12B14383: objToJSON (objToJSON.c:2080)
==4137== by 0x55401A: ??? (in /usr/bin/python3.11)
==4137== by 0x52ED62: _PyObject_MakeTpCall (in /usr/bin/python3.11)
==4137== by 0x53C61C: _PyEval_EvalFrameDefault (in /usr/bin/python3.11)
Looking at the code, it initially looks like the fix would be to update PyDateTimeToIsoCallback
to match NpyTimeDeltaToIsoCallback
where it stores the return char *
in GET_TC(tc)->cStr
so it can be freed later. However in https://github.com/pandas-dev/pandas/commit/fb6c4e33c45938d7675d4c9a132324cd08df2f3c, @WillAyd removed the call to PyObject_Free(GET_TC(tc)->cStr);
so the cStr will not be freed anyway.
Thus, @WillAyd do you have any thoughts on how this should be resolved?
Comment From: Alvaro-Kothe
Hi @swt2c, thanks for pointing out that the problem still exists. Can you checkout my pull request (#62217) and verify if the memory leak persists?