Pandas ENH: - Aurora Blog|java/go/python

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I wish I could request that Pandas use nullable extension dtypes when reading in data rather than casting columns to float or object in the case of NULLs. This would library authors wrapping Pandas who wish to use nullable extension types to do so without either requiring their users to input dtypes manually or requiring the library to convert the dtypes ex post.

Feature Description

The signature would look something like this:

pandas.read_csv(
  filepath_or_buffer,
  *,
  ...,
  use_nullable_dtypes=False,
)

Semantically, if use_nullable_dtypes==True, whenever a DataFrame column would be of type np.int{n} or np.bool{n} except for the presence of NULLs, the column would instead be inferred as type pd.Int{n}Dtype() or pd.BooleanDtype(). Alternatively, the flag could just convert all instances of np.int{n} or np.bool{n} to the nullable types, regardless of whether they contain NULLs.

Alternative Solutions

As far as I'm aware the only current solutions are

Add the dtypes that you want to convert to the dtype argument. This is problematic in the case of libraries wrapping Pandas, and requires the user to know not only the type of all of the columns but also their size if they are to be stored efficiently.
Parse all the columns as object and then try to convert them to the appropriate data type one by one. This has obvious efficiency problems.

Additional Context

No response

Comment From: lithomas1

Can you give dtype_backend="numpy_nullable" a try?

Comment From: jacgoldsm

@lithomas1 that worked but I think the documentation is wrong, because it says that that's the default in 2.1.1:

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’

But when I run it I have to explicitly pass in the parameter

import pandas as pd
import io
pd.read_csv(io.StringIO("1,1,NA\n"),header=None)
    0  1   2
0  1  1 NaN

pd.read_csv(io.StringIO("1,1,NA\n"),header=None,dtype_backend="numpy_nullable")

     0  1     2
0  1  1  <NA>

Version info:

pd.__version__ : 2.1.1 sys.version : 3.9.2 (default, Feb 28 2021, 17:03:44) \n[GCC 10.2.1 20210110] sys.platform : linux

Comment From: lithomas1

You're right, I think we need to add something saying that the default is to use numpy dtypes that will cast things like ints to floats.

Are you interested in putting up a PR?