There are some optimizations I'd like to make in core.internals (mostly making the signatures stricter so we can do less validation at __init__
-time). But that risks breaking changes for downstream packages that use the existing non-public APIs.
I'd like to start by asking downstream projects that access internals to identify what they actually use/need.
Once we identify what is used, my thought is to 1) encourage usage of public APIs where possible and 2) provide backward-compatible psuedo-public APIs (xref #40182).
cc @jorisvandenbossche (pyarrow), @TomAugspurger (dask), @mdurant (fastparquet), @shoyer (xarray), anyone else?
Comment From: jorisvandenbossche
As mentioned in https://github.com/pandas-dev/pandas/pull/40182#issuecomment-789832388, pyarrow makes use of the make_block()
and BlockManager
constructors.
Further, pyarrow accesses the CategoricalBlock
, DatetimeTZBlock
etc classes, but only as object, eg for isinstance(block, ObjectBlock)
(never calling the __init__
of those). Now, I suppose that we could remove this usage (instead of checking for CategoricalBlock, we could also check that block.values
is a Categorical, for example).
Pyarrow also uses the Block.values
attribute.
Comment From: jorisvandenbossche
As mentioned above, pyarrow uses the Block classes for isinstance
checks. Removing the CategoricalBlock (https://github.com/pandas-dev/pandas/pull/40527) broke this, so I think we need to add back the CategoricalBlock class for now (and at the same time remove in in pyarrow, of course, opened https://issues.apache.org/jira/browse/ARROW-12057 to track this)
Comment From: jorisvandenbossche
Another type of use cases that eg both dask (partd) and pyarrow have: accessing Block.values
.
For example, the change to return DatetimeArray from Datetime(like)Block
instead of an ndarray[datetime64[ns]] breaks dask (because DatetimeArray is not a proper EA but some hybrid with a np.dtype)
Comment From: jbrockmendel
breaks dask
anything in particular?
Comment From: shoyer
Xarray does not rely upon pandas's private APIs.
Comment From: jorisvandenbossche
@jbrockmendel because DatetimeArray is not a proper EA, is_extension_array_dtype
failed to catch the case of DatetimeArray, which resulted in taking a numpy-only code path with DatetimeArray (see https://github.com/dask/partd/issues/48, https://github.com/dask/partd/pull/49/)
Comment From: jbrockmendel
@jorisvandenbossche has any of pyarrow's usage of pandas internals changed in the last 2 years? In particular im wondering if the introduction of ArrowDtype might make it easier for pyarrow to use public constructors
Comment From: jorisvandenbossche
I don't think much has changed in that part of pyarrow. ArrowDtype also won't change anything on the short term, since pyarrow constructs blocks with default dtypes.
Comment From: jbrockmendel
The only thing left I see in internals.api that isn't deprecated is maybe_infer_ndim. Is anyone out there using this? Searching GH all im seeing is people that have vendored pandas.