Pandas Practical steps towards a simplified BlockManager

In the mailing list discussion on a simplified (non-consolidating) BlockManager, @jbrockmendel brought up the relevant question of how we could get there (incrementally?): https://mail.python.org/pipermail/pandas-dev/2020-May/001223.html

Since that is a more practical discussion, moving that to an issue here.

Longer term, there are multiple options of how such a blockmanager could be enabled by the user (listing 2 here, but there are probably other options as well):

Go "all in" on extension arrays. We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing code paths specifically for the case of having all 1D ExtensionBlocks.
A "consolidation policy" option similarly as in the branch discussed in https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So that might not be fully ideal. We could add 1D versions of our numeric blocks as well, but that would probably add a lot of complexity, although temporary, to the Blocks, so maybe not an ideal path forward.

For the "all extension arrays" option, we could also use light-weight EAs to store numpy arrays (like the "PandasArrays" we have now, only then actually using it), as long as we don't yet have actual extension arrays for all dtypes. For both, we could have a constructor keyword to enabled it, and/or a global config option to enable it.

Now, on the shorter term, there are probably some work items we could tackle to reduce the API surface / usage of blocks outside of the internals, to make things like the above more realistic to implement (and make it easier to experiment with alternative BlockManager implementations):

Continue removing cases where pandas code outside of the internals work directly on blocks (@jbrockmendel you already did a lot of work on this recently, do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases?)
Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. Although, @jbrockmendel, with your work in https://github.com/pandas-dev/pandas/pull/33052, https://github.com/pandas-dev/pandas/pull/33347, https://github.com/pandas-dev/pandas/pull/33332 (removal of BlockManager.get/set/delete) this might actually already be done?
Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list. (Maybe actually reducing might not be possible, but the exercise can still be useful to get an explicit overview of what is needed to implement a BlockManager).

It might be that there are already more concrete open issues about some of those aspects.

cc @pandas-dev/pandas-core

Comment From: jbrockmendel

Continue removing cases where pandas code outside of the internals work directly on blocks [...] do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

This is now easy to grep for because it is ._mgr instead of ._data. The places where I'd really like to avoid this are, in roughly the order in which they are code-smelly:

io.formats.csv L337
io.pytables L3877
core.groupby.generic._cython_agg_blocks
_libs.reduction (a whole other can of worms)
core.resample._take_new_index (just calls obj._mgr.reindex_indexer, so not that big a deal, but id really prefer if we could use a DataFrame method here)

Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. [...] this might actually already be done?

I lost momentum on this, would be happy to see it picked back up. IIRC BlockManager.insert was the next one I had planned to address but there wasn't a nice way to change it without a dramatic overhaul.

Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.

Comment From: mroeschke

do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

rolling operations also operate block-wise, but removal would be fairly straightforward

Comment From: jbrockmendel

rolling operations also operate block-wise

where is that implemented? Does it go through BlockManager.apply?

Comment From: mroeschke

In rolling: https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/rolling.py#L232

Which ultimately hits: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5479

Which would be neat to get rid of since I think this is the only usage of this method in core/

Comment From: jbrockmendel

thanks, ill take a look

Comment From: mroeschke

I can volunteer to change rolling to column-wise

Comment From: jbrockmendel

I can volunteer to change rolling to column-wise

I think the thing to do is go through BlockManager.apply, so as to be agnostic to the ongoing 1D/2D issue