Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
On pandas < 2.1 (e.g. 1.5.3, 2.0.3):
import pandas as pd
pd.Series(['a'], name='hi').to_pickle('G:/temp/test.pkl')
On pandas 2.3.0 and main:
import pandas as pd
ser = pd.read_pickle('G:/temp/test.pkl') # appears to work
ser2 = pd.Series(['a'], name='hi') # works
pd.testing.assert_series_equal(ser, ser2) # works
pd.testing.assert_series_equal(ser, ser.copy()) # Attribute "name" are different
Issue Description
In doing a migration from 1.5.3 to the 2.x series we hit an issue where copying an unpickled Series drops its name (the actual operation was a .reindex_like
, which called .copy
under the hood). The bug begins with the pandas 2.1 series; I believe this may have been introduced in #51784 when the Series metadata was changed from name to _name.
Expected Behavior
It seems like an unpickled Series and its copy should be equal in all attributes, since that's what .copy does. However anything which does a copy (including implicit copies, such as calling .reindex()
) currently causes the name to be dropped inadvertently.
Now I'm not sure to what extent read_pickle
guarantees that all actions on an unpickled legacy object work the same way on a newly-created object. That said, one reason this may be worth fixing is that the problem seems to persist in new versions, i.e. rewriting the pickle with the new version directly doesn't mitigate the problem:
# using version 2.3.0
# read legacy pickle
ser = pd.read_pickle('G:/temp/test.pkl')
# write out new pickle of the object
ser.to_pickle('G:/temp/ser_copy.pkl')
# read in new pickle
ser_copy = pd.read_pickle('G:/temp/ser_copy.pkl')
pd.testing.assert_series_equal(ser, ser_copy) # works
pd.testing.assert_series_equal(ser_copy, ser_copy.copy()) # fails, even though ser_copy is read in from a pickle created in 2.3.0)
And of course obviously calling ser.copy() to get a new pandas 2.3 object also does not work.
Thus it seems the only workaround to: 1) Read in the legacy pickle 2) Serialize the legacy pickle to some other format 3) Deserialize the other format 4) Serialize the newly-created object as a replacement pickle
Installed Versions
Comment From: rhshadrach
Thanks for the report. From https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#pickling:
read_pickle() is only guaranteed backwards compatible back to a few minor release.
So this indeed is not a supported case.
I would go even further and think about dropping the promise of "a few minor releases". pickles are really not meant for transferring data across environments, and trying to do so is going to be a constant source of edge cases. We should instead encourage users to use proper data formats like parquet that handles the vast majority of cases (just not general Python objects).
cc @pandas-dev/pandas-core
Comment From: TomAugspurger
We should instead encourage users to use proper data formats
+1
Comment From: shiyangcao
pickled_obj = pd.read_pickle(file_obj)
if isinstance(pickled_obj, pd.Series):
pickled_obj = pd.Series(pickled_obj, copy=False)
For some reason this seems to work while .copy() does not, is that expected? @Liam3851 perhaps you can try that.