Pandas BUG: pyarrow stripping leading zeros with dtype=str

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

x = """
    AB|000388907|abc|0150
    AB|101044572|abc|0150
    AB|000023607|abc|0205
    AB|100102040|abc|0205
"""

df_arrow = pd.read_csv(
    StringIO(x),
    delimiter="|",
    header=None,
    dtype=str,
    engine="pyarrow",
    keep_default_na=False,
)

df_python = pd.read_csv(
    StringIO(x),
    delimiter="|",
    header=None,
    dtype=str,
    engine="python",
    keep_default_na=False,
)

df_arrow
        0          1    2    3
0      AB     388907  abc  150
1      AB  101044572  abc  150
2      AB      23607  abc  205
3      AB  100102040  abc  205

df_python
        0          1    2     3
0      AB  000388907  abc  0150
1      AB  101044572  abc  0150
2      AB  000023607  abc  0205
3      AB  100102040  abc  0205

Issue Description

when I use engine=pyarrow and set dtype to str i am seeing the leading zeros in my numeric columns removed even though the resulting column type is 'O'. When I use the python engine I see that the leading zeros are still there as expected.

Expected Behavior

I would expect when treating all columns as strings that the leading zeros are retained and the data is unmodified.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-17-generic Version : #17-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan 11 14:20:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.1.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.27 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: benjaminbauer

can confirm this in 2.2.1

Comment From: mroeschke

Do you get the same result when you use the pyarrow.csv.read_csv method directly? https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html

Comment From: benjaminbauer

seems to be a problem with the pyarrow engine in pandas (pandas 2.2.1, pyarrow 15.0.1)

pandas engine pandas dtype: 01
pyarrow engine pandas dtype: 1
pandas engine pyarrow dtype: 01
pyarrow engine pyarrow dtype: 1
pyarrow native: 01

import io

import pandas as pd
import pyarrow as pa
from pyarrow import csv

csv_file = io.BytesIO(
    """a
01""".encode()
)


print(f"pandas engine pandas dtype: {pd.read_csv(csv_file, dtype=str).iloc[0,0]}")

csv_file.seek(0)
print(
    f"pyarrow engine pandas dtype: {pd.read_csv(csv_file, dtype=str, engine='pyarrow').iloc[0,0]}"
)

csv_file.seek(0)
print(
    f"pandas engine pyarrow dtype: {pd.read_csv(csv_file, dtype='str[pyarrow]').iloc[0,0]}"
)

csv_file.seek(0)
print(
    f"pyarrow engine pyarrow dtype: {pd.read_csv(csv_file, dtype='str[pyarrow]', engine='pyarrow').iloc[0,0]}"
)

csv_file.seek(0)
convert_options = csv.ConvertOptions(column_types={"a": pa.string()})
print(
    f"pyarrow native: {csv.read_csv(csv_file, convert_options=convert_options).column(0).to_pylist()[0]}"
)

Comment From: kristinburg

take

Comment From: jorisvandenbossche

For context, the reason this is happening is because currrently the dtype argument is only handled as post-processing after pyarrow has read the CSV file. So, currently, we let pyarrow read the csv file with inferring (in this case) numerical types, and then afterwards we cast the result to the specified dtype (in this case str). So that explains why the leading zeros are lost.

https://github.com/pandas-dev/pandas/blob/e51039afe3cbdedbf5ffd5cefb5dea98c2050b88/pandas/io/parsers/arrow_parser_wrapper.py#L216

PyArrow does provide a column_types keyword to specify the dtype while reading: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions

So what we need to do is translate the dtype keyword in read_csv to the column_types argument for pyarrow, somewhere here:

https://github.com/pandas-dev/pandas/blob/e51039afe3cbdedbf5ffd5cefb5dea98c2050b88/pandas/io/parsers/arrow_parser_wrapper.py#L120-L134

As a first step, I would just try to enable specifying it for a specific column, like dtype={"a": str}, because for something like dtype=str which applies to all columns, with the current pyarrow API you would need to know the column names up front (pyarrow only accepts specifying it per column, not for all columns at once)

Comment From: dxdc

Still encountering this issue with pyarrow version: 20.0.0 and pandas version: 2.3.0

Comment From: dxdc

From my review of arrow_parser_wrapper.py (see @jorisvandenbossche's comment above), I think there are a few steps:

In _get_pyarrow_options, self.dtype needs to be mapped to self.kwds["column_types"]. In order to do this effectively, we need:
pandas to pyarrow type conversion (does this fn exist?) so that we can supply native dtype function types. Alternatively, we could just pass them to pyarrow "as is" and let pyarrow raise exceptions as needed.
an ordered list of all columns (does this fn exist?) in the case that no specific column names are provided or in the case that (I believe) columns are specified by index
self.convert_options needs to be amended to include column_types as an allowable kwd
In _finalize_pandas_output, self.dtype handling should be amended to handle the index_col logic. We may need the ordered column list for this.
In _finalize_pandas_output, remaining self.dtype post processing should be removed. This is causing bugs in the logic (e.g., stripping leading zeros).

I tested some of these steps locally, and it works.

Comment From: marianobiondo

The issue arises because the relevant section of the pandas source code (as referenced by @jorisvandenbossche) does not properly handle pyarrow's column_types. To address this, we simply need to add a line immediately after defining self.convert_options (around line 130):

self.convert_options["column_types"] = self.kwds.get("dtype", {}) P.S.: This was observed using pandas 2.3.1 and pyarrow 21.0.0.

Comment From: dxdc

@marianobiondo thx for your reply, but there are some caveats as I was pointing out:

pandas is a lot more friendly in how dtypes can be specified. we need to map those to native pyarrow implementations

for ex, pandas can have dtype: str which applies to all columns or even dtype matching by col index (not name).

Not sure Pyarrow can handle these specified as native python types for example. Pyarrow has more limitations. That's why I laid out the steps above.

That being said, mapping it internally to Pyarrow would also be an option (and cross-posting the issue on their library). Then we could go with your suggestion.