Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
class MySeq(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeq
seq = MySeq([*'abc'], name='data')
assert seq.name == 'data'
assert seq[1:2].name == 'data'
assert seq[[1, 2]].name is None
assert seq.drop_duplicates().name is None
Issue Description
pandas 2.2.3
Let’s consider two variants of defining a custom subtype of pandas.Series
. In the first one, no custom properties are added, while in the second one, custom metadata is included:
import pandas as pd
class MySeries(pd.Series):
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
print(f'''Case without _metadata:
{isinstance(seq[0:1], MySeries) = }
{isinstance(seq[[0, 1]], MySeries) = }
{seq[0:1].name = }
{seq[[0, 1]].name = }
''')
class MySeries(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
seq.property = 'MyProperty'
print(f'''Case with defined _metadata:
{isinstance(seq[0:1], MySeries) = }
{isinstance(seq[[0, 1]], MySeries) = }
{seq[0:1].name = }
{seq[[0, 1]].name = }
{getattr(seq[0:1], 'property', 'NA') = }
{getattr(seq[[0, 1]], 'property', 'NA') = }
''')
The output of the code above will be:
Case without _metadata:
isinstance(seq[0:1], MySeries) = True
isinstance(seq[[0, 1]], MySeries) = True
seq[0:1].name = 'data'
seq[[0, 1]].name = 'data'
Case with defined _metadata:
isinstance(seq[0:1], MySeries) = True
isinstance(seq[[0, 1]], MySeries) = True
seq[0:1].name = 'data'
seq[[0, 1]].name = None <<< Problematic result of indexing
getattr(seq[0:1], 'property', 'NA') = 'MyProperty'
getattr(seq[[0, 1]], 'property', 'NA') = 'MyProperty'
So, if _metadata
is defined, the sequence name is preserved when slicing, but lost when indexing with a list, whereas without _metadata
the name is preserved in both cases.
As a workaround we can add 'name'
to _metadata
:
class MySeries(pd.Series):
_metadata = ['property', 'name']
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
assert seq[0:1].name == 'data'
assert seq[[0, 1]].name == 'data'
However, I'm not sure if there's no deferred issues caused by treating name
as a metadata attribute.
The problem arose when applying PyJanitor methods to user-defined DataFrames with _metadata
. Specifically, drop_duplicates
was applied to a separate column, followed by an attempt to access its name
in order to combine the result into a new DataFrame.
Expected Behavior
import pandas as pd
class MySeq(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeq
seq = MySeq([*'abc'], name='data')
assert seq[[1, 2]].name == 'data'
assert seq.drop_duplicates().name == 'data'
Installed Versions
Comment From: vitalizzare
I've found that a Series
object has _metadata = ['_name']
by default. This means that when manually defining _metadata
in a custom Series
subclass, we need to explicitly add '_name'
to it as well. I couldn't find this information in the documentation. Maybe it should be mentioned here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties. What do you think?