Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
a = pd.DataFrame({0:[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={0:[4,5,6]})
a.join(b, on=0, rsuffix='_test')
Issue Description
When dataframe a and b have the same column name, a key_ column is created unexpectedly after the join operation.
key_0 0 0_test 0 1 1 4 1 2 2 5 2 3 3 6
Expected Behavior
Expecting result without the key_0 column.
Installed Versions
Comment From: ShayanG9
This seems to be the expected behavior if you look at the documentation examples. Moreover, it seems it is why we pass lsuffix and rsuffix. Please let me know if I am missing somthing.
Comment From: albb318
In this case the "key_0" columns seems redundant, as the logic is straightforward. Dataframe a's column 0 is the target column for the join operation, although Dataframe b also has column 0, but it can just rename it to 0_test (as specifed by rsuffix), and join by its index. Expecting result as the following
0 0_test 0 1 4 1 2 5 2 3 6
Interestingly, the behavior is different with letter column name, see the following example
a = pd.DataFrame({'a':[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={'a':[4,5,6]})
a.join(b, on='a', rsuffix='_test')
it will get result as
a a_test 0 1 4 1 2 5 2 3 6
Comment From: ShayanG9
take
Comment From: ShayanG9
Thanks for clarifying! I think I have found the issue. The join function is basically a wrapper for the merge function, so down stream in pandas/core/reshape/merge.py
There is a line of code that does
result.insert(i, name or f"key_{i}", key_col)
This basically adds the column to the DataFrame, with the variable on or "key_{column_name}". The intent for this was to catch when name was None. But, because 0
is truthy, while every other number is falsey the result is that you get the latter for columns with 0. This seems like just a oversight from the person who implemented this.
The fix would be
result.insert(i, name if name is not None else f"key_{i}", key_col)
However, this is not the actual issue the issue comes a bit earlier and it is specifically due to how join passes the suffix "" to merge instead of None. This causes behavior like if any of the column names is an integer it does not join like their string counter parts instead making two columns: one an integer and one a string.
There are two solutions to this. One would be to instead of defining lsuffix
and rsuffix
to be ""
it should be None
. The other is changing the renamer
function to not touch the column if the suffix string is empty.
Both would have some minor backward compatibility issues, the suffix changes in join would be more localized to just the join function while the latter would affect the merge feature and the join feature. However, Its seems pretty uncommon to add a empty suffix to a merge operation and even less common to use integer columns. And, changing join to None would be more inline with users expectations when using .join function.
Thus, changing the join signature would impact backwards compatibility the least ensuring developers using merge with empty strings maintain their functionality, while aligning the join feature to be more like what happens when a string counterpart is used.
Comment From: rhshadrach
One would be to instead of defining
lsuffix
andrsuffix
to be""
it should beNone
.
+1. In addition we should document in these args that when either is specified, any non-string columns will be converted to strings before applying the suffix.
Comment From: ShayanG9
@rhshadrach I'm not sure how I should go about this. Should I raise a FutureWarning
and leave the default value as is. Or, should I change the default value to None
and put up a DepreciationWarning
?
Comment From: rhshadrach
@ShayanG9 - it should start as a DeprecationWarning
. You will likely need to change the default value lib.no_default
to be able to tell if the user is passing None
or not. You can search the code for other uses of lib.no_default
as examples.
Comment From: KevsterAmp
@ShayanG9 can I work on this issue?
Comment From: ShayanG9
@KevsterAmp Go right ahead it should be an easy fix. I've been really busy lately.
Comment From: KevsterAmp
Thanks
Comment From: KevsterAmp
take