Feature Type

  • [ ] Adding new functionality to pandas

  • [x] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

The documentation for pandas.read_csv(usecols=[...]) says that it treats the iterable list of columns like an unordered set (updated in https://github.com/pandas-dev/pandas/issues/18673 and #53763), so the returned dataframe won't necessarily have the same column order. This is different behaviour from other pandas data reading methods (e.g., pandas.read_parquet(columns=[...])). I think the order should be preserved. If usecols is converted to a set, I think it should instead be converted to OrderedSet or keys of collections.OrderedDict (or just dict in Python >3.6).

Feature Description

import pandas as pd

# Example CSV file (replace with your actual file)
csv_data = """
col1,col2,col3,col4
A,1,X,10
B,2,Y,20
C,3,Z,30
"""

with open("example.csv", "w") as f:
    f.write(csv_data)

# Desired column order
desired_order = ['col3', 'col1', 'col4']

# Read CSV with usecols (selects columns but doesn't order)
df = pd.read_csv("example.csv", usecols=desired_order)

print(df)  # incorrect column order

# Reindex DataFrame to enforce desired order (a popular workaround that I think shouldn't be required)
# One solution is to include this line in `read_csv`, when using `usecols` kwarg
df = df[desired_order]

print(df)  # correct column order

Alternative Solutions

Instead of converting usecols to set, convert it to dict.keys() which preserved order in Python >3.6

Additional Context

No response

Comment From: amarvin

This is also an issue for pandas.read_excel(usecols=[...]).

Comment From: amarvin

Others are confused by the current feature too and have to do a workaround: https://stackoverflow.com/a/40024462/6068036

Comment From: eicchen

take

Comment From: AnkitPrasad364

Replace

if usecols: usecols = set(usecols)

With

if usecols: usecols = dict.fromkeys(usecols) # preserves order

Comment From: eicchen

@mroeschke could I get your opinion on this before I dig deeper into it? You were the last person to work with the function (_validate_usecols_arg) and I'm mainly worried about backwards compatibility rather than feasibility. But considering that pandas is having a major version update, it could be justifiable.

Comment From: Dr-Irv

IMHO, we shouldn't make this change, but I could be convinced otherwise. There are 2 reasons:

  1. We do document how to preserve the order (this was introduced in https://github.com/pandas-dev/pandas/pull/19746 )
  2. The "order" isn't clear if the argument is a callable.

Comment From: eicchen

As promised during the sync meeting today, I went and compiled how various read functions handle columns being specified. Functions that take usecols (read_csv, read_clipboard, read_excel, and read_hdf(undocumented)) don't take into account input order, whereas functions that ask for columns instead do (hdf, feather, parquet, orc, starata, sql).

Finally, there are also some that straight up don't take column specifiers.

I'd expect functions that use usecols to be using the same function in the backend, but I'd have to verify it if we're planning to standardize the parameter.

CSV attached below of functions tested (those with a read and write function in pandas) does_it_use_order.csv

Comment From: eicchen

@Dr-Irv Do you think that it would still warrant further discussion? Or should I just go ahead and implement it?

I think adding an optional param in read_csv would solve this issue as all the other import functions which use "usecols" instead of "columns" seem to link back to read_csv in some way.

Comment From: Dr-Irv

@Dr-Irv Do you think that it would still warrant further discussion? Or should I just go ahead and implement it?

I think adding an optional param in read_csv would solve this issue as all the other import functions which use "usecols" instead of "columns" seem to link back to read_csv in some way.

Can you attend the dev meeting tomorrow (June 10) so we can discuss it there?

Comment From: eicchen

I do think that that would be the best option but unfortunately I have a flight during that time, so I can either hijack the end of the new contributor meeting next Wednesday or discuss is during the one after