Pandas Should IntegerArray provide data / mask through an API?

I'm currently implementing some algorithms on top of the IntegerArray class with numba. Therefore I would need to pass in the two separate backing NumPy arrays instead of the pandas class. Using series.to_numpy() isn't helpful in this case as this returns a object-typed array. For now, I'll keep using series.values._data and series.values._mask but I'm aware that using private variables is not a long-term solution.

Given that series.values._data may be undefined where the mask is True, I know that returning the raw data may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.

Comment From: TomAugspurger

Yeah, but we have two optimizations planned that makes this a bit tricky:

Make ._mask optional. If there are no missing values then we can avoid allocating the array to store it.
Make ._mask a bitmask rather than a bytemask.

series.values._mask

Just an FYI, series.array.isna() will give you _mask through a public API. Doesn't help you with ._data though :)

Actually... can you use a combination of these two?

data = array.to_numpy(na_value=0)
mask = array.isna()

I think that's 0-copy access to both components...

Edit: no it's definitely not zero-copy, since we insert na_value for the NAs.

Comment From: xhochy

Thank you for the detailed answer!

I think array.isna() is a pretty decent interface for me as it will be stable regarding the future performance optimization 2.. When 2. is implemented, having a public interface to retrieve the bitmask would also be nice but at least this way, my code won't break on implementation changes (it will just get a bit slower).

Comment From: TomAugspurger

One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement __arrow_array__, so you can get zero-copy access to it in a way that won't change if / when we update our implementation.

cc @jorisvandenbossche if you have thoughts.

Comment From: jorisvandenbossche

Note that array.isna() also gives a copy of the mask, so not necessarily optimal. The __arrow_array__ is indeed zero-copy for the data, but not for the mask since that gets converted into a bitmask. If you need it as a bitmask, that might be fine, but if you actually need a numpy boolean mask for numba, that would give a double conversion bytemask-bitmask->bytemask.

But, long term it would indeed be good to have some more official way to access this. We could add a return_mask=True/False option to to_numpy()?

Comment From: jorisvandenbossche

Note that array.isna() also gives a copy of the mask,

Actually, we don't:

https://github.com/pandas-dev/pandas/blob/8ba9c627e2ad630d977ae504a3bf0e2ec5a9885a/pandas/core/arrays/masked.py#L225-L226

but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array ..

Comment From: xhochy

As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only pandas + numba. Using the __arrow_array__ interface for this would be for me on the same level as just accessing the private members of the array. I'm interesting in converting some existing computations that were yet implemented with numba on numpy float+int arrays to also support the new integer extension type.

I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by pandas. As long as it is a bytemask, take that and when the internal implementation switches, take the bitmask.

Taking a step back, an alternative for this use case would also be to provide an interface in pandas Series.apply_nonnull(numba_func_that_works_on_scalars) like it's done in Rolling.apply. This would solve my problem of having access to the raw arrays while removing the need for the user to check for nulls on byte- or bitmask. Especially as in the case of bitmask, I have not yet found a simple and performant numba access pattern.

Comment From: kkraus14

We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask.

For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU.

cc @brandon-b-miller

Comment From: TomAugspurger

I think the best we can do for now is say "use ._mask and ._data with the caveats that"

You should always check that ._mask is not None (to protect against it being optional in the future)
You should always check that _mask.size / itemsize is what you want (to protect against it being a bitmask in the future)

And we'll agree among friends that pandas won't change the names?

We could offer a public .mask and .data but I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.

Comment From: kkraus14

Thanks @TomAugspurger! We'll use ._data and ._mask for now.

FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have .data / .mask nullify the offset by doing the pointer arithmetic / copy construction (if needed) respectively, but also provide a .base_data and .base_mask for when the underlying allocation is needed to be accessed zero copy without handling the offset.

Comment From: jbrockmendel

The names _mask and _data haven't changed in the 5 years since this was opened. I don't think there's any interest in making them public, as users shouldn't be accessing them directly.