-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Note: This code sample requires pygeos, a python interface to the GEOS library. I don't know how to reproduce in a more minimal way without pygeos because the sample requires the numpy ufunc mechanism, which I can't easily exercise with sample code. Maybe someone with more C experience could make a smaller reproduction.
>>> import pandas as pd
>>> import pygeos
>>> pd.array(["POINT (0 0)"])
<StringArray>
['POINT (0 0)']
Length: 1, dtype: string
>>> pygeos.from_wkt(pd.array(["POINT (0 0)"]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/envs/geopandas-dev/lib/python3.9/site-packages/pygeos/io.py", line 181, in from_wkt
return lib.from_wkt(geometry, **kwargs)
File "/usr/local/Caskroom/miniconda/base/envs/geopandas-dev/lib/python3.9/site-packages/pandas/core/arrays/numpy_.py", line 254, in __array_ufunc__
result = type(self)(result)
File "/usr/local/Caskroom/miniconda/base/envs/geopandas-dev/lib/python3.9/site-packages/pandas/core/arrays/string_.py", line 195, in __init__
self._validate()
File "/usr/local/Caskroom/miniconda/base/envs/geopandas-dev/lib/python3.9/site-packages/pandas/core/arrays/string_.py", line 200, in _validate
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
ValueError: StringArray requires a sequence of strings or pandas.NA
Problem description
The pygeos.from_wkt
function takes an array of strings and returns an array of geometry objects. But when this input array is a pandas StringArray
we get an error. It seems that the PandasArray.__array_ufunc__
implementation assumes the results of the ufunc will go into the same type of array as the input, in this case a StringArray
, which produces an error when the results are not a string. (Credit to @jorisvandenbossche in a comment on pygeos/pygeos#338.)
Expected Output
If we pass a numpy array into pygeos.from_wkt
it works fine and produces an output array of geometries.
>>> import numpy as np
>>> np.array(["POINT (0 0)"])
array(['POINT (0 0)'], dtype='<U11')
>>> pygeos.from_wkt(np.array(["POINT (0 0)"]))
array([<pygeos.Geometry POINT (0 0)>], dtype=object)
I would expect the same output using a StringArray
.
>>> pygeos.from_wkt(pd.array(["POINT (0 0)"]))
array([<pygeos.Geometry POINT (0 0)>], dtype=object)
Output of pd.show_versions()
Comment From: jorisvandenbossche
@johnflavin thanks for moving the issue here.
To be able to reproduce it without pygeos, we need a ufunc
that works on string data. I am not sure there is a built-in one in numpy, but we can create one from a python function using np.frompyfunc
:
In [10]: str_len_ufunc = np.frompyfunc(lambda x: len(x), 1, 1)
In [11]: arr = pd.array(["a", "bb"], dtype="string")
In [12]: str_len_ufunc(arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-3738fc027ef2> in <module>
----> 1 str_len_ufunc(arr)
~/scipy/pandas/pandas/core/arrays/numpy_.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
180 if not lib.is_scalar(result):
181 # re-box array-like results, but not scalar reductions
--> 182 result = type(self)(result)
183 return result
184
~/scipy/pandas/pandas/core/arrays/string_.py in __init__(self, values, copy)
209 self._dtype = StringDtype() # type: ignore[assignment]
210 if not isinstance(values, type(self)):
--> 211 self._validate()
212
213 def _validate(self):
~/scipy/pandas/pandas/core/arrays/string_.py in _validate(self)
214 """Validate that we only store NA or strings."""
215 if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 216 raise ValueError("StringArray requires a sequence of strings or pandas.NA")
217 if self._ndarray.dtype != "object":
218 raise ValueError(
ValueError: StringArray requires a sequence of strings or pandas.NA
Comment From: johnflavin
@jorisvandenbossche Perfect! Thank you for the simpler reproduction.
Comment From: gwerbin-tive
I just ran into this using recent versions of GeoPandas and Shapely, with the same from_wkt
routine.
Poking around in the debugger, it looks like it's trying to use the input array type to wrap/re-box the output, which might be of a different type.
Is it just a matter of changing the StringArray.__array_ufunc__
implementation to not do that? For example, maybe it should use lib.is_string_array
to determine whether to re-wrap or not.