Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

a = pd.DataFrame({0:[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={0:[4,5,6]})
a.join(b, on=0, rsuffix='_test')

Issue Description

When dataframe a and b have the same column name, a key_ column is created unexpectedly after the join operation.

key_0 0 0_test 0 1 1 4 1 2 2 5 2 3 3 6

Expected Behavior

Expecting result without the key_0 column.

Installed Versions

pandas : 2.2.3 numpy : 1.26.4

Comment From: ShayanG9

This seems to be the expected behavior if you look at the documentation examples. Moreover, it seems it is why we pass lsuffix and rsuffix. Please let me know if I am missing somthing.

Comment From: albb318

In this case the "key_0" columns seems redundant, as the logic is straightforward. Dataframe a's column 0 is the target column for the join operation, although Dataframe b also has column 0, but it can just rename it to 0_test (as specifed by rsuffix), and join by its index. Expecting result as the following

0 0_test 0 1 4 1 2 5 2 3 6

Interestingly, the behavior is different with letter column name, see the following example

a = pd.DataFrame({'a':[1,2,3]})
b = pd.DataFrame(index=[1,2,3],data={'a':[4,5,6]})
a.join(b, on='a', rsuffix='_test')

it will get result as

a a_test 0 1 4 1 2 5 2 3 6

Comment From: ShayanG9

take

Comment From: ShayanG9

Thanks for clarifying! I think I have found the issue. The join function is basically a wrapper for the merge function, so down stream in pandas/core/reshape/merge.py There is a line of code that does

result.insert(i, name or f"key_{i}", key_col)

This basically adds the column to the DataFrame, with the variable on or "key_{column_name}". The intent for this was to catch when name was None. But, because 0 is truthy, while every other number is falsey the result is that you get the latter for columns with 0. This seems like just a oversight from the person who implemented this.

The fix would be

result.insert(i, name if name is not None else f"key_{i}", key_col)

However, this is not the actual issue the issue comes a bit earlier and it is specifically due to how join passes the suffix "" to merge instead of None. This causes behavior like if any of the column names is an integer it does not join like their string counter parts instead making two columns: one an integer and one a string.

There are two solutions to this. One would be to instead of defining lsuffix and rsuffix to be "" it should be None. The other is changing the renamer function to not touch the column if the suffix string is empty.

Both would have some minor backward compatibility issues, the suffix changes in join would be more localized to just the join function while the latter would affect the merge feature and the join feature. However, Its seems pretty uncommon to add a empty suffix to a merge operation and even less common to use integer columns. And, changing join to None would be more inline with users expectations when using .join function.

Thus, changing the join signature would impact backwards compatibility the least ensuring developers using merge with empty strings maintain their functionality, while aligning the join feature to be more like what happens when a string counterpart is used.

Comment From: rhshadrach

One would be to instead of defining lsuffix and rsuffix to be "" it should be None.

+1. In addition we should document in these args that when either is specified, any non-string columns will be converted to strings before applying the suffix.

Comment From: ShayanG9

@rhshadrach I'm not sure how I should go about this. Should I raise a FutureWarning and leave the default value as is. Or, should I change the default value to None and put up a DepreciationWarning?

Comment From: rhshadrach

@ShayanG9 - it should start as a DeprecationWarning. You will likely need to change the default value lib.no_default to be able to tell if the user is passing None or not. You can search the code for other uses of lib.no_default as examples.

Comment From: KevsterAmp

@ShayanG9 can I work on this issue?

Comment From: ShayanG9

@KevsterAmp Go right ahead it should be an easy fix. I've been really busy lately.

Comment From: KevsterAmp

Thanks

Comment From: KevsterAmp

take