Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Having duplicated columns can lead to confusing downstream behavior that might be difficult to detect, e.g. we recently had this occur in Altair for a couple of users https://github.com/altair-viz/altair/issues/2718.

Feature Description

It was suggested in the PR that introduced the flag to disallow duplicates that this might be suitable as a default option in the future https://github.com/pandas-dev/pandas/pull/28394#issuecomment-530868784, but I couldn't find a follow up discussion so I 'm opening this issue to suggest that this becomes the default behavior to protect users from doing things they might not intend to, like selecting the same column twice.

Alternative Solutions

Keep the current default

Additional Context

No response

Comment From: topper-123

I'm not sure what my opinion is on this, but open to discussions.

Currently, we disallow by setting an attribute in flags (see here), which IMO is the wrong API and we should rather have a parameter in the index constructor, like Index(..., allow_duplicates=False) instead. Then it would be easier to discuss if the parameter flag should be False or True.

Comment From: topper-123

To add, the flag-based approach doesn't allow us to decide if we want label duplicates in the DataFrame constructor, which doesn't seem right. E.g. we'd want

>>> df = pd.DataFrame(data,
...     index=Index(..., allow_duplicates=True|False),
...     columns=Index(..., allow_duplicates=True|False),
... )

for precise control in the constructor. Also, a decision has to be if non-duplicate labels also means non-duplicate label indexing, e.g. should we disallow df.loc[["a", "a"]] when we disallow duplicate labels.

Comment From: tomhoq

Is this still to be implemented?