Pandas API: make CategoricalIndex._concat consistent with pd.concat

Index._concat (used by Index.append) is thin wrapper around concat_compat. It is overriden by CategoricalIndex so that CategoricalDtype is retained more often than it is in concat_compat. We should make these match.

If we just rip CategoricalIndex._concat, we break 6 tests, all of which boil down to:

    def test_append_category_objects(self, ci):
        # with objects
        result = ci.append(Index(["c", "a"]))
        expected = CategoricalIndex(list("aabbcaca"), categories=ci.categories)
>       tm.assert_index_equal(result, expected, exact=True)

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

    def test_concat_empty_series_dtype_category_with_array(self):
        # GH#18515
        left = Series(np.array([]), dtype="category")
        right = Series(dtype="float64")
        result = concat([left, right])
>        assert result.dtype == "float64"


    def test_concat_categorical_coercion(self):
        # GH 13524

        # category + not-category => not-category
        s1 = Series([1, 2, np.nan], dtype="category")
        s2 = Series([2, 1, 2])

        exp = Series([1, 2, np.nan, 2, 1, 2], dtype="object")
>       tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[1, 2], ordered=False)
E       [right]: object

Changing concat_compat results in much more convenient behavior, but it is textbook "values-dependent behavior" that in general we want to avoid (cc @jorisvandenbossche)

Comment From: TomAugspurger

I vaguely recall some discussions around changing the default behavior of pd.concat to union categories when provided with multiple CategoricalDtype objects, rather than casting to object. IMO, we should address that first (through a deprecation cycle). IIUC it'd then be easier to make the two consistent.

Comment From: jbrockmendel

that looks similar but i think may be orthogonal. in all of the affected tests cases i think we're dealing with one Categorical and one non-Categorical

Comment From: jreback

IIUC we should strive to improve concat_compat to make this do better inference, e.g.

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

is what would do. I think is a strict improvement.

Comment From: jbrockmendel

@jorisvandenbossche want to weigh in here (before i get started on a PR)? one of the options here is value-dependent behavior

Comment From: jorisvandenbossche

I think I would opt for preserving the strict behaviour of Series. Although it is certainly tempting to make an exception. But having the behavior depend on which numbers are present (eg in the last test example) really doesn't sound ideal. The user can always cast to the dtype of the first object for doing the concat.

(the case of concatting with an empty other Series is something that could be addressed separately, IMO, eg by having a "null" dtype for empty Series)

Other idea: if we find it onerous for the user to cast all arguments passed to concat/append themselves to ensure consistent dtypes, we could also add a keyword argument to concat/append that would do that for you. But this would then be a more general solution (for all dtypes), instead of adding a special case only for categorical dtype.

Comment From: jbrockmendel

Possibly related: #12509, #14016, #15332, #24093, #24845, #25019, #37480, #44099, #42840

Comment From: jbrockmendel

Looking again at the tests that fail if we get rid of the special-casing, the basic issue is roughly:

# test_loc_setitem_expansion_label_unused_category
ci = pd.CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c', 'e'])
df = pd.DataFrame({"A": range(6), index=ci)
df.loc["e"] = 20

When we do this setitem-with-expansion, we create a new DataFrame (inside _append_internal) with index=Index(["e"]) and then do a concat. But what we really want is CategoricalIndex(["e"], dtype=ci.dtype). So we get that by calling CategoricalIndex._concat later in the process.

We can construct examples for other EA dtypes where we don't get the casting we'd like:

idx = pd.Index([pd.Timestamp(0).date()], dtype="date32[pyarrow]")
df = pd.DataFrame({"A": range(1)}, index=idx)
item = pd.Timestamp("1970-01-02").date()

df.loc[item] = 1
assert df.index.dtype == idx.dtype  # <- nope! its object

I suspect that the general-case solution here is to use EA._cast_pointwise_result earlier in the process.