Pandas Pandas Tests rely on inconsistent array coercion

In https://github.com/numpy/numpy/pull/14995 I have tried to make numpy consistent with respect to coercing dataframes (and other array-likes which also implement the sequence protocol) to numpy arrays.

With the new PR/behaviour, the __array__ interface would be fully preferred, and no mixed/inconsistent behaviour with respect to also being a sequence-like (with different behaviour) would occur.

Unfortunately, pandas DataFrames have this behaviour, since they are squence-like. This behaviour kicks in during DataFrame coercion, in the following case:

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])

Where df2 is currently coerced as a dataframe with dataframes inside. Currently this happens due to the following logic:

        try:
            if is_list_like(values[0]) or hasattr(values[0], 'len'):  # <-- is hit
                # following convert does nothing; `np.array()` than raises Error...
                values = np.array([convert(v) for v in values])
            elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
                # GH#21861
                values = np.array([convert(v) for v in values])
            else:
                values = convert(values)
        except (ValueError, TypeError):
            values = convert(values)  # <-- Ends up getting called and forces object array.

EDIT: addtional code details: convert is a thin wrapper around:

def maybe_convert_platform(values):
    """ try to do platform conversion, allow ndarray or list here """

    if isinstance(values, (list, tuple, range)):
        values = construct_1d_object_array_from_listlike(values)
    # more logic

This takes the first branch (values is a list), which in turn forces a 1-D object array:

def construct_1d_object_array_from_listlike(values):
    # numpy will try to interpret nested lists as further dimensions, hence
    # making a 1D array that contains list-likes is a bit tricky:
    result = np.empty(len(values), dtype='object')
    result[:] = values
    return result

because np.array([df1, df1]) will raise an error due to the inconsistencies within NumPy, it ends up calling convert([df1, df1]) which in turn creates a NumPy dtype=object array with two dataframes inside. However, the new/correct behaviour for NumPy would be to that np.array([df1, df1]) will return a 3 dimensional array. This ends up raising an error because pandas refuses to coerce a 3D array to a DataFrame.

It seems safest to not try to squeeze this into the upcoming NumPy release (it is planned in a few days). However, I would like to change it in master soon after branching. I am not sure if you see the current behaviour as important or not, but it would be nice if you can look into what the final intend will be here. If we (can) change this in NumPy I am not sure there is a way for pandas to retain the old behaviour.

Comment From: jbrockmendel

However, the new/correct behaviour for NumPy would be to that np.array([df1, df1]) will return a 3 dimensional array.

In this case you have df1 twice, but what if you had two dataframes of different shapes in there?

pandas refuses to coerce a 3D array to a DataFrame.

I guess we could coerce to an xarray object

Nested listlikes are a PITA, but it isn't clear that there's a better alternative. Is there something specific we need to fix here? Or is this a "be aware of" kind of thing?

Comment From: seberg

If they have different shapes, things become interesting. Since numpy will automatically give it less dimensions (we are changing that).

Pandas has 3 tests (I think) which would fail if I just do this. The question is if you think that there is any issue with breaking this behaviour. It does seem fairly useless to me, but we cannot deprecate it really. So if pandas users rely on it, it would suddenly be broken.

In other words: I expect there is nothing you need to do. Unless you want to use it as an excuse to start cleaning up the listlike coercion in general.

Comment From: jreback

I would expect

DataFrame([df1, df]2) no matter the shape of df1 and df2 (same or different) to raise a ValueError; the only way we would support this is if dtype=object is specified.

similar to how this is handled

In [2]: arr                                                                                                                                                                                                                 
Out[2]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [3]: pd.DataFrame([arr, arr])                                                                                                                                                                                            
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-0a6417e1d1de> in <module>
----> 1 pd.DataFrame([arr, arr])

~/pandas/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    467                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    468                 else:
--> 469                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    470             else:
    471                 mgr = init_dict({}, index, columns, dtype=dtype)

~/pandas/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    155     # by definition an array here
    156     # the dtypes will be coerced to a single dtype
--> 157     values = prep_ndarray(values, copy=copy)
    158 
    159     if dtype is not None:

~/pandas/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    279         values = values.reshape((values.shape[0], 1))
    280     elif values.ndim != 2:
--> 281         raise ValueError("Must pass 2-d input")
    282 
    283     return values

ValueError: Must pass 2-d input

this is just too magical

In [4]: pd.DataFrame([pd.DataFrame(arr), pd.DataFrame(arr)])                                                                                                                                                                
Out[4]: 
                                                   0
0     0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  ...
1     0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  ...

so I think we should actually deprecate / change the current behavior now.

Comment From: seberg

Just a heads up, I have rebased that change in NumPy gh-14995, and would hope that fixing up pandas for it will be simple enough. It would be nice to get it over with (supporting such weird behaviours is just a pain moving forward). If you have concerns or we end up merging it and it is hard to catch up, let me know and we can revert...

Comment From: jbrockmendel

Just ran the test suite on that branch and only found 2 failures, both of which look like we're doing something sketchy that can be fixed on our end without too much trouble. Thanks for the heads up.

Comment From: jbrockmendel

@seberg any chance this has resolved itself in the interim?