Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# Create a DataFrame with multi-level columns
data = {('Apple Inc.', 'abstract'): [1, 2], ('Apple Inc.', 'web_url'): [3, 4],
('Adobe Inc.', 'abstract'): [5, 6], ('Adobe Inc.', 'web_url'): [7, 8]}
test_df = pd.DataFrame(data)
# It will look something like this:
# (Apple Inc., abstract) (Apple Inc., web_url) (Adobe Inc., abstract) (Adobe Inc., web_url)
# 1 3 5 7
# 2 4 6 8
# Create a DataFrame with company-ticker mapping
NASDAQ_Ticker = pd.DataFrame({'Company': ['Apple Inc.', 'Adobe Inc.'],
'Ticker': ['AAPL', 'ADBE']})
def company_to_ticker_index(df):
new_columns = {}
for item in df.columns:
# Directly unpack the tuple into variables
company_name, label = item
# Find the corresponding ticker symbol for the company
ticker = NASDAQ_Ticker.loc[NASDAQ_Ticker['Company'] == company_name]['Ticker'].squeeze()
# Create the new column label
new_label = (ticker, label)
# Add the new label to the dictionary
new_columns[item] = new_label
# Rename the columns using the dictionary
df.rename(columns=new_columns, inplace=True)
# Test the function
company_to_ticker_index(test_df)
# Print the new column names to check
print(test_df.columns)
Issue Description
Here I attempt to rename the columns, which should now be tuples with ticker symbols instead of company names. However, the resulting dataframe still unexpectedly reflects the company labels.
Expected Behavior
Relabeling of the dataframe multi-index columns form (company, X) to (ticker, X).
Installed Versions
Comment From: miltonsin345
That's messed
Comment From: hedeershowk
I think you just need to do more like df.rename(columns={'Apple Inc.': 'AAPL'})
. You don't need the tuple there. See this stack overflow reply for more details.
Comment From: nickzoic
Yeah, I've found the same thing, I think, and this is maybe an easier demonstration:
import pandas as pd
df1 = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]).groupby('a').agg({'b': ('count', 'sum')})
print("DF1:")
print(df1)
df2 = df1.rename(columns={('b', 'count'): 'fnord'}, errors='raise')
print("DF2:")
print(df2)
df1.rename(columns={('b', 'fnord'): 'c'}, errors='raise')
produces the output:
DF1:
b
count sum
a
1 1 2
3 1 4
DF2:
b
count sum
a
1 1 2
3 1 4
Traceback (most recent call last):
File "/home/nick/Work/wehi/pandas_test/test2.py", line 13, in <module>
df1.rename(columns={('b'): 'c'}, errors='raise')
File "/home/nick/Work/wehi/pandas_test/.direnv/python-3.10.7/lib/python3.10/site-packages/pandas/core/frame.py", line 5640, in rename
return super()._rename(
File "/home/nick/Work/wehi/pandas_test/.direnv/python-3.10.7/lib/python3.10/site-packages/pandas/core/generic.py", line 1090, in _rename
raise KeyError(f"{missing_labels} not found in axis")
KeyError: "[('b', 'fnord')] not found in axis"
You can see that df2
is unchanged from df1
.
It isn't just that the column ('b', 'count')
isn't found as I've set errors='raise'
and if you try renaming some other column combination eg: ('b', 'fnord')
it raises a KeyError
.
Seems to do the same in 2.0.3, 2.1.1 and e0d6051f985994e594b07a2b93b9ca2eff43eae4
Comment From: nickzoic
OK looking at the source code a piece of the puzzle falls into place:
- the checking of the allowed values is done by
pandas.core.generic._rename
- the check is done only if
errors == "raise"
(see #13473) -
this checks for the whole index value tuple's existence, not the individual levels.
-
the transforming of the index values is done by
pandas.core.indexes.base._transform_index
. - this substitutes each part of the index value tuple on whichever level
- unless
level
is not None, in which case only on that level)
- unless
- this works fine if
errors != "raise"
.
So these are incompatible, and in the case where errors == "raise"
you can't rename multi level indexes.
Either the checking should be fixed or the transforming should be changed.
The former is probably less problematic (even though it doesn't solve my problem[1]) as people will have used the rename-on-every-level behaviour without errors == "raise"
and changing this would break stuff.
[1] ... which is better solved by to_flat_index ...
Comment From: TabLand
I was stung by this issue earlier today. I was able to workaround my issue by exporting to a dict, performing my column renames there and then creating a new DataFrame. Whilst trying to rename a MultiIndex column using a tuple feels intuitive, it introduces additional complexity in terms of adding or removing levels... I also experienced some circumstances where a tuple key is treated as an individual column name by pandas (e.g when all sub and super column names are unique and / or rows are not indexed - which makes sense).
From what I currently understand of the complexity, I agree that it makes sense to have the checking code consistently match the actual behaviour of the transformation code, and perhaps improving the documentation to clarify that all columns including sub-columns are renamed individually.
I will try to draft a PR within the next week...
Comment From: TabLand
Hi Team, Could someone review & test the patch provided in #56936?
Comment From: ramwin
The bug still exist on pandsa 2.3.1. Currently you can avoid by using del
df[("new_level1", "new_level2")] = df[("old_level1", "old_level2")]
del df[("old_level1", "old_level2")