Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project="***")
path = 'gs://***/2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path, filesystem=fs)
Issue Description
When parquet files contain logical type like map, read_parquet behaviors differently on local parquet file and gcs file.
The same parquet file can be loaded as pandas df from local, but got error when load from gcs:
ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>
load from local file:
import pandas as pd
path = '2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path)
print("Row count:", len(df))
---------------------------------------------------------------------------
Row count: 327827
load from gcs:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project="***")
path = 'gs://***/2025-09-01T00_15_30_events.parquet'
df = pd.read_parquet(path, filesystem=fs)
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
Cell In[6], line 5
3 fs = gcsfs.GCSFileSystem(project="***")
4 path = 'gs://****/2025-09-01T00_15_30_events.parquet'
----> 5 df = pd.read_parquet(path, filesystem=fs)
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:265, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
...
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>
Expected Behavior
The read_parquet should be able to load the same GCS parquet file same as loading from local file.
Installed Versions
Comment From: Alvaro-Kothe
Does it work if you use
df = pd.read_parquet(path, storage_options=fs.storage_options)
Comment From: legiondean
Does it work if you use
df = pd.read_parquet(path, storage_options=fs.storage_options)
Still the same error:
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
Cell In[2], line 5
3 fs = gcsfs.GCSFileSystem(project="***")
4 path = 'gs:/***/events.parquet'
----> 5 df = pd.read_parquet(path, storage_options=fs.storage_options)
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
666 use_nullable_dtypes = False
667 check_dtype_backend(dtype_backend)
--> 669 return impl.read(
670 path,
671 columns=columns,
672 filters=filters,
673 storage_options=storage_options,
674 use_nullable_dtypes=use_nullable_dtypes,
675 dtype_backend=dtype_backend,
676 filesystem=filesystem,
677 **kwargs,
678 )
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pandas/io/parquet.py:265, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
258 path_or_handle, handles, filesystem = _get_path_or_handle(
259 path,
260 filesystem,
...
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/Apps/python_virtual_envs/cursor/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowTypeError: Unable to merge: Field type has incompatible types: binary vs dictionary<values=string, indices=int32, ordered=0>
Comment From: Alvaro-Kothe
I can't reproduce on main, nor on 2.3.1. I uploaded a parquet file to gcs and tested this script:
import pandas as pd
import gcsfs
bucket = "my-bucket"
file_name = "test.parquet"
path = f"gcs://{bucket}/{file_name}"
fs = gcsfs.GCSFileSystem()
print(fs.ls(bucket))
# ['my-bucket/test.parquet']
df = pd.read_parquet(path, engine="pyarrow", filesystem=fs)
print(df)
# a b
# 0 1 2
# 1 2 22
You are probably missing some dependency. The most important one is fsspec. Here is my versions:
Comment From: Alvaro-Kothe
If you still can't read using the file uri, you can also get a file handler from gcsfs and use pandas to read it:
with fs.open(f"{bucket}/{file_name}") as fh:
df = pd.read_parquet(fh, engine="pyarrow")
Comment From: legiondean
If you still can't read using the file uri, you can also get a file handler from gcsfs and use pandas to read it:
with fs.open(f"{bucket}/{file_name}") as fh: df = pd.read_parquet(fh, engine="pyarrow")
Thanks @Alvaro-Kothe , the with fs.open(f"{bucket}/{file_name}") as fh: does work.
From what I see, under the hood of pd.read_parquet(, it's also with fs.open that streaming the file. But pd.read_parquet( does throw the above error with the same gcs parquet. This confused me. Maybe something is off in there.
On another hand, other frameworks like Dask also use pd.read_parquet( to load parquets. That means I can't use Dask to handle large files until this issue is resolved.
Comment From: legiondean
I can't reproduce on main, nor on 2.3.1. I uploaded a parquet file to gcs and tested this script:
import pandas as pd import gcsfs
bucket = "my-bucket" file_name = "test.parquet" path = f"gcs://{bucket}/{file_name}"
fs = gcsfs.GCSFileSystem() print(fs.ls(bucket))
['my-bucket/test.parquet']
df = pd.read_parquet(path, engine="pyarrow", filesystem=fs) print(df)
a b
0 1 2
1 2 22
You are probably missing some dependency. The most important one is
fsspec. Here is my versions:
I also tested against parquets which only have simple columns, not able to produce the error either. The issue occur only with partitioned parquets that contain complex column types like
identifier ROW(
wrt VARCHAR,
cuid VARCHAR,
ipv4 INTEGER,
user_agent VARCHAR,
account_info ROW(
hashed_username VARCHAR,
previous_order_count INTEGER,
signup_method VARCHAR
),
operating_system VARCHAR,
We generated those parquet files via spark or trino (probably doesn't matter).
Comment From: legiondean
The fsspec is available in my env. Also as mentioned above, the same parquet file has no error if it's downloaded to local to process.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
commit : c888af6d0bb674932007623c0867e1fbd4bdc2c6
python : 3.13.3
python-bits : 64
OS : Darwin
OS-release : 24.6.0
Version : Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:40 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6041
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.3.1
numpy : 2.3.2
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.2
Cython : None
sphinx : None
IPython : 9.5.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.9.0
html5lib : None
hypothesis : None
gcsfs : 2025.9.0
jinja2 : 3.1.6
lxml.etree : None
matplotlib : 3.10.6
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : 2.9.10
pymysql : None
pyarrow : 19.0.1
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.43
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None