Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project="***")
path = 'gs://***/2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path, filesystem=fs)

Issue Description

When parquet files contain logical type like map, read_parquet behaviors differently on local parquet file and gcs file. The same parquet file can be loaded as pandas df from local, but got error when load from gcs: ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>

load from local file:

import pandas as pd
path = '2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path)
print("Row count:", len(df))
---------------------------------------------------------------------------
Row count: 327827

load from gcs:

import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project="***")
path = 'gs://***/2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path, filesystem=fs)
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[6], line 5
      3 fs = gcsfs.GCSFileSystem(project="***")
      4 path = 'gs://****/2025-09-01T00_15_30_events.parquet'
----> 5 df = pd.read_parquet(path, filesystem=fs)

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
    666     use_nullable_dtypes = False
    667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
    670     path,
    671     columns=columns,
    672     filters=filters,
    673     storage_options=storage_options,
    674     use_nullable_dtypes=use_nullable_dtypes,
    675     dtype_backend=dtype_backend,
    676     filesystem=filesystem,
    677     **kwargs,
    678 )

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:265, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    258 path_or_handle, handles, filesystem = _get_path_or_handle(
    259     path,
    260     filesystem,
...
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>

Expected Behavior

The read_parquet should be able to load the same GCS parquet file same as loading from local file.

Installed Versions

INSTALLED VERSIONS ------------------ commit : c888af6d0bb674932007623c0867e1fbd4bdc2c6 python : 3.13.3 python-bits : 64 OS : Darwin OS-release : 24.6.0 Version : Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:40 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6041 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.3.1 numpy : 2.3.2 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 25.2 Cython : None sphinx : None IPython : 9.5.0 adbc-driver-postgresql: None ... zstandard : 0.23.0 tzdata : 2025.2 qtpy : None pyqt5 : None

Comment From: Alvaro-Kothe

Does it work if you use

df = pd.read_parquet(path, storage_options=fs.storage_options)

Comment From: legiondean

Does it work if you use

df = pd.read_parquet(path, storage_options=fs.storage_options)

Still the same error:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[2], line 5
      3 fs = gcsfs.GCSFileSystem(project="***")
      4 path = 'gs:/***/events.parquet'
----> 5 df = pd.read_parquet(path, storage_options=fs.storage_options)

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
    666     use_nullable_dtypes = False
    667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
    670     path,
    671     columns=columns,
    672     filters=filters,
    673     storage_options=storage_options,
    674     use_nullable_dtypes=use_nullable_dtypes,
    675     dtype_backend=dtype_backend,
    676     filesystem=filesystem,
    677     **kwargs,
    678 )

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:265, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    258 path_or_handle, handles, filesystem = _get_path_or_handle(
    259     path,
    260     filesystem,
...
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>

Comment From: Alvaro-Kothe

I can't reproduce on main, nor on 2.3.1. I uploaded a parquet file to gcs and tested this script:

import pandas as pd
import gcsfs

bucket = "my-bucket"
file_name = "test.parquet"
path = f"gcs://{bucket}/{file_name}"

fs = gcsfs.GCSFileSystem()
print(fs.ls(bucket))
# ['my-bucket/test.parquet']

df = pd.read_parquet(path, engine="pyarrow", filesystem=fs)
print(df)
#    a   b
# 0  1   2
# 1  2  22

You are probably missing some dependency. The most important one is fsspec. Here is my versions:

INSTALLED VERSIONS ------------------ commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.13 python-bits : 64 OS : Linux OS-release : 6.16.8-200.fc42.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Sep 19 17:47:18 UTC 2025 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : pt_BR.UTF-8 LOCALE : pt_BR.UTF-8 pandas : 2.2.0 numpy : 2.3.3 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 25.2 Cython : 3.1.4 sphinx : 8.2.3 IPython : 9.5.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.14.0 blosc : None bottleneck : 1.6.0 dataframe-api-compat : None fastparquet : 2024.11.0 fsspec : 2025.9.0 html5lib : 1.1 hypothesis : 6.140.2 gcsfs : 2025.9.0 jinja2 : 3.1.6 lxml.etree : 6.0.2 matplotlib : 3.10.6 numba : 0.62.0 numexpr : 2.13.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.10 pymysql : 1.4.6 pyarrow : 21.0.0 pyreadstat : 1.3.1 pytest : 8.4.2 python-calamine : None pyxlsb : 1.0.10 s3fs : 2025.9.0 scipy : 1.16.2 sqlalchemy : 2.0.43 tables : 3.10.2 tabulate : 0.9.0 xarray : 2025.9.0 xlrd : 2.0.2 xlsxwriter : 3.2.9 zstandard : 0.25.0 tzdata : 2025.2 qtpy : None pyqt5 : None

Comment From: Alvaro-Kothe

If you still can't read using the file uri, you can also get a file handler from gcsfs and use pandas to read it:

with fs.open(f"{bucket}/{file_name}") as fh:
    df = pd.read_parquet(fh, engine="pyarrow")

Comment From: legiondean

If you still can't read using the file uri, you can also get a file handler from gcsfs and use pandas to read it:

with fs.open(f"{bucket}/{file_name}") as fh: df = pd.read_parquet(fh, engine="pyarrow")

Thanks @Alvaro-Kothe , the with fs.open(f"{bucket}/{file_name}") as fh: does work. From what I see, under the hood of pd.read_parquet(, it's also with fs.open that streaming the file. But pd.read_parquet( does throw the above error with the same gcs parquet. This confused me. Maybe something is off in there.

On another hand, other frameworks like Dask also use pd.read_parquet( to load parquets. That means I can't use Dask to handle large files until this issue is resolved.

Comment From: legiondean

I can't reproduce on main, nor on 2.3.1. I uploaded a parquet file to gcs and tested this script:

import pandas as pd import gcsfs

bucket = "my-bucket" file_name = "test.parquet" path = f"gcs://{bucket}/{file_name}"

fs = gcsfs.GCSFileSystem() print(fs.ls(bucket))

['my-bucket/test.parquet']

df = pd.read_parquet(path, engine="pyarrow", filesystem=fs) print(df)

a b

0 1 2

1 2 22

You are probably missing some dependency. The most important one is fsspec. Here is my versions:

I also tested against parquets which only have simple columns, not able to produce the error either. The issue occur only with partitioned parquets that contain complex column types like

identifier ROW(
    wrt VARCHAR,
    cuid VARCHAR,
    ipv4 INTEGER,
    user_agent VARCHAR,
    account_info ROW(
      hashed_username VARCHAR,
      previous_order_count INTEGER,
      signup_method VARCHAR
    ),
    operating_system VARCHAR,

We generated those parquet files via spark or trino (probably doesn't matter).

Comment From: legiondean

The fsspec is available in my env. Also as mentioned above, the same parquet file has no error if it's downloaded to local to process.

import pandas as pd

pd.show_versions()

INSTALLED VERSIONS
------------------
commit                : c888af6d0bb674932007623c0867e1fbd4bdc2c6
python                : 3.13.3
python-bits           : 64
OS                    : Darwin
OS-release            : 24.6.0
Version               : Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:40 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6041
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : en_US.UTF-8

pandas                : 2.3.1
numpy                 : 2.3.2
pytz                  : 2025.2
dateutil              : 2.9.0.post0
pip                   : 25.2
Cython                : None
sphinx                : None
IPython               : 9.5.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.13.4
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2025.9.0
html5lib              : None
hypothesis            : None
gcsfs                 : 2025.9.0
jinja2                : 3.1.6
lxml.etree            : None
matplotlib            : 3.10.6
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 19.0.1
pyreadstat            : None
pytest                : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : 2.0.43
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : 0.23.0
tzdata                : 2025.2
qtpy                  : None
pyqt5                 : None