Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
data = """a,b
1,1
,1
1582218195625938945,1
"""
result = pd.read_csv(StringIO(data), dtype={"a": "Int64"}, engine='pyarrow')
Issue Description
this outputs
In [69]: result
Out[69]:
a b
0 1 1
1 <NA> 1
2 1582218195625938944 1
Expected Behavior
In [69]: result
Out[69]:
a b
0 1 1
1 <NA> 1
2 1582218195625938945 1
Installed Versions
Comment From: MarcoGorelli
This was hiding due to https://github.com/pandas-dev/pandas/issues/55882
cc @phofl
Comment From: phofl
Yeah this is round tripping through float and a know issue
Comment From: jbrockmendel
i think im going to have to address this in order to fix a kludge in #62040. so spitballing: The current logic is roughly
frame = arrow_table_to_pandas(table, dtype_backend=dtype_backend, null_to_int64=True)
return self._finalize_pandas_output(frame)
If self.dtype
is not None, then _finalize_pandas_output
will call frame.astype(self.dtype)
(after some massaging). What if instead we do:
if self.dtype is None or dtype_backend == "pyarrow":
frame = what_we_do_in_the_status_quo()
elif not isinstance(self.dtype, dict):
frame = pretend_dtype_backend_is_pyarrow_since_astype_will_handle_it()
else:
new_dtype = {item: what_we_would_get_from_passing_actual_dtype_backend(item) for item in table.columns}
new_dtype.update(self.dtype)
frame = pretend_dtype_backend_is_pyarrow_since_astype_will_handle_it()
Thoughts?
Comment From: jbrockmendel
Looks like the approach above works, fixes this and one other xfailed test. But it breaks 5 categorical tests. Three of these are just surfacing an unrelated bug which we can address separately (#62051). The other two are about getting the expected string dtype for dtype.categories.dtype
. Mostly its just ugly so far.