I'm currently implementing some algorithms on top of the IntegerArray
class with numba
. Therefore I would need to pass in the two separate backing NumPy arrays instead of the pandas
class. Using series.to_numpy()
isn't helpful in this case as this returns a object
-typed array. For now, I'll keep using series.values._data
and series.values._mask
but I'm aware that using private variables is not a long-term solution.
Given that series.values._data
may be undefined where the mask is True
, I know that returning the raw data
may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.
Comment From: TomAugspurger
Yeah, but we have two optimizations planned that makes this a bit tricky:
- Make
._mask
optional. If there are no missing values then we can avoid allocating the array to store it. - Make
._mask
a bitmask rather than a bytemask.
series.values._mask
Just an FYI, series.array.isna()
will give you _mask
through a public API. Doesn't help you with ._data
though :)
Actually... can you use a combination of these two?
data = array.to_numpy(na_value=0)
mask = array.isna()
I think that's 0-copy access to both components...
Edit: no it's definitely not zero-copy, since we insert na_value
for the NAs.
Comment From: xhochy
Thank you for the detailed answer!
I think array.isna()
is a pretty decent interface for me as it will be stable regarding the future performance optimization 2.
. When 2.
is implemented, having a public interface to retrieve the bitmask would also be nice but at least this way, my code won't break on implementation changes (it will just get a bit slower).
Comment From: TomAugspurger
One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement __arrow_array__
, so you can get zero-copy access to it in a way that won't change if / when we update our implementation.
cc @jorisvandenbossche if you have thoughts.
Comment From: jorisvandenbossche
Note that array.isna()
also gives a copy of the mask, so not necessarily optimal.
The __arrow_array__
is indeed zero-copy for the data, but not for the mask since that gets converted into a bitmask. If you need it as a bitmask, that might be fine, but if you actually need a numpy boolean mask for numba, that would give a double conversion bytemask-bitmask->bytemask.
But, long term it would indeed be good to have some more official way to access this.
We could add a return_mask=True/False
option to to_numpy()
?
Comment From: jorisvandenbossche
Note that array.isna() also gives a copy of the mask,
Actually, we don't:
https://github.com/pandas-dev/pandas/blob/8ba9c627e2ad630d977ae504a3bf0e2ec5a9885a/pandas/core/arrays/masked.py#L225-L226
but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array ..
Comment From: xhochy
As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only pandas + numba
. Using the __arrow_array__
interface for this would be for me on the same level as just accessing the private members of the array. I'm interesting in converting some existing computations that were yet implemented with numba on numpy float+int arrays to also support the new integer extension type.
I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by pandas
. As long as it is a bytemask, take that and when the internal implementation switches, take the bitmask.
Taking a step back, an alternative for this use case would also be to provide an interface in pandas Series.apply_nonnull(numba_func_that_works_on_scalars)
like it's done in Rolling.apply
. This would solve my problem of having access to the raw arrays while removing the need for the user to check for nulls on byte- or bitmask. Especially as in the case of bitmask, I have not yet found a simple and performant numba
access pattern.
Comment From: kkraus14
We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask.
For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU.
cc @brandon-b-miller
Comment From: TomAugspurger
I think the best we can do for now is say "use ._mask
and ._data
with the caveats that"
- You should always check that
._mask
is not None (to protect against it being optional in the future) - You should always check that
_mask.size
/itemsize
is what you want (to protect against it being a bitmask in the future)
And we'll agree among friends that pandas won't change the names?
We could offer a public .mask
and .data
but I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.
Comment From: kkraus14
Thanks @TomAugspurger! We'll use ._data
and ._mask
for now.
FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have .data
/ .mask
nullify the offset by doing the pointer arithmetic / copy construction (if needed) respectively, but also provide a .base_data
and .base_mask
for when the underlying allocation is needed to be accessed zero copy without handling the offset.
Comment From: jbrockmendel
The names _mask and _data haven't changed in the 5 years since this was opened. I don't think there's any interest in making them public, as users shouldn't be accessing them directly.