-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
The sample at https://github.com/pandas-dev/pandas/blob/6210077d32a9e9675526ea896e6d1f9189629d4a/pandas/core/arrays/base.py#L1031-L1047 is buggy since it doesn't handle scalars properly
class MyDtype(ExtensionDtype):
name = "name"
class MyArray(ExtensionArray):
dtype = MyDtype()
def __init__(self, data):
self._data = data
@classmethod
def _from_sequence(cls, scalars, *, dtype=None, copy=False):
return cls(np.array(scalars))
def __getitem__(self, item):
return self._data[item]
def __len__(self):
return len(self._data)
def take(self, indices, allow_fill=False, fill_value=None):
from pandas.core.algorithms import take
# If the ExtensionArray is backed by an ndarray, then
# just pass that here instead of coercing to object.
data = self.astype(object)
if allow_fill and fill_value is None:
fill_value = self.dtype.na_value
# fill value should always be translated from the scalar
# type for the array, to the physical storage type for
# the data, before passing to take.
result = take(data, indices, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result, dtype=self.dtype)
a = MyArray._from_sequence([1, 2, 3])
result = a.take(0)
assert result == 1
Problem description
Expected Output
.take(0)
should return the scalar 1
, rather than trying to wrap it in a new MyArray
.
Output of pd.show_versions()
Comment From: TomAugspurger
Hmm I may have been mistaken. I confused myself since my extension array isn't naturally a "scalar", and is (somewhere) being converted back to a length-2 ndarray.
Comment From: jorisvandenbossche
My expectation was that take
requires always a list-like of indices, and shouldn't work for scalar indices (so that the return value is always a new EA of the same type).
That seems to be confirmed by the docstring, but then the example implementation should maybe check for that?
Comment From: jorisvandenbossche
On the other hand, numpy's take
also works with scalar indices (although also documented to accept a array like of indices)
And for example IntegerArray also doesn't give a proper error message for it:
In [19]: arr = pd.array([1, 2, 3])
In [20]: arr.take(0)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-6b42d04f0add> in <module>
----> 1 arr.take(0)
~/scipy/pandas/pandas/core/arrays/masked.py in take(self, indexer, allow_fill, fill_value)
324 mask = mask ^ fill_mask
325
--> 326 return type(self)(result, mask, copy=False)
327
328 def copy(self: BaseMaskedArrayT) -> BaseMaskedArrayT:
~/scipy/pandas/pandas/core/arrays/integer.py in __init__(self, values, mask, copy)
289 if not (isinstance(values, np.ndarray) and values.dtype.kind in ["i", "u"]):
290 raise TypeError(
--> 291 "values should be integer numpy array. Use "
292 "the 'pd.array' function instead"
293 )
TypeError: values should be integer numpy array. Use the 'pd.array' function instead
So at least we should decide on the exact spec, and then update implementation or error checking / base tests / docs based on that.
Comment From: TomAugspurger
NumPy accepting scalars is somewhat new:
indices : array_like (Nj...)
The indices of the values to extract.
.. versionadded:: 1.8.0
Also allow scalars for indices.
I don't have a strong preference, but it's probably best match NumPy if it doesn't introduce any issues.
Comment From: jbrockmendel
1) no reason to support scalars as they should never be passed 2) no reason to add a check since its just perf overhead that should never be needed 3) no harm in adding a note in the docs