Pandas BUG: inconsistent behavior and crash in DataFrame.__setitem__ when >=3d ndarray is used

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# BUG Behavior 1:

df = pd.DataFrame(np.zeros((4, 1)))
# error, expected
# ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))
# no error, not expected
df["A"] = np.zeros((4, 2, 3))
print(df) # exception here

'''
>>> print(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1096, in __repr__
    return self.to_string(**repr_params)
  File "/home/azuk/pandasdev/pandas/pandas/core/frame.py", line 1273, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1099, in to_string
    string = string_formatter.to_string()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 30, in to_string
    text = self._get_string_representation()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 45, in _get_string_representation
    strcols = self._get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/string.py", line 36, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 614, in get_strcols
    strcols = self._get_strcols_without_index()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 878, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 892, in format_col
    return format_array(
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1295, in format_array
    return fmt_obj.get_result()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1328, in get_result
    fmt_values = self._format_strings()
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1576, in _format_strings
    return list(self.get_result_as_array())
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1543, in get_result_as_array
    formatted_values = format_values_with(float_format)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1523, in format_values_with
    result = _trim_zeros_float(values, self.decimal)
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1972, in _trim_zeros_float
    while should_trim(trimmed):
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in should_trim
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1969, in <listcomp>
    numbers = [x for x in values if is_number_with_decimal(x)]
  File "/home/azuk/pandasdev/pandas/pandas/io/formats/format.py", line 1959, in is_number_with_decimal
    return re.match(number_regex, x) is not None
  File "/home/azuk/.conda/envs/pandas-dev/lib/python3.10/re.py", line 190, in match
    return _compile(pattern, flags).match(string)
TypeError: cannot use a string pattern on a bytes-like object
'''

# BUG Behavior 2:

df = pd.DataFrame(np.zeros((4, 1)))
# ok 
df['A'] = np.zeros((4, 1))
# no error, not expected here
# expcted ValueError: Expected a 1D array, got an array with shape (4, 2)
df['A'] = np.zeros((4, 2))

Issue Description

Input array demension is only checked in BlockManager.insert.

https://github.com/pandas-dev/pandas/blob/f5a5c8d7f0d1501e5d8ff31b3b5f24c916137d9c/pandas/core/internals/managers.py#L1404-L1409

The code only checks for 2d ndarray, so a >=3d ndarray can be set to crash DataFrame.
The code only checks for inserting, so the value replacing for Series will not raise an Exception.

The issue shares a same reason for #51925 .

Expected Behavior

ValueException should be raised for both situations.

Installed Versions

INSTALLED VERSIONS ------------------ commit : f5a5c8d7f0d1501e5d8ff31b3b5f24c916137d9c python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-70-generic Version : #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+828.gf5a5c8d7f0 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 0.29.33 pytest : 7.3.1 hypothesis : 6.75.3 sphinx : 6.2.1 blosc : None feather : None xlsxwriter : 3.1.1 lxml.etree : 4.9.2 html5lib : 1.1 pymysql : 1.0.3 psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : 8.13.2 pandas_datareader: None bs4 : 4.12.2 bottleneck : 1.3.7 brotli : fastparquet : 2023.4.0 fsspec : 2023.5.0 gcsfs : 2023.5.0 matplotlib : 3.7.1 numba : 0.57.0 numexpr : 2.8.4 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.0 pyreadstat : 1.2.1 pyxlsb : 1.0.10 s3fs : 2023.5.0 scipy : 1.10.1 snappy : sqlalchemy : 2.0.15 tables : 3.8.0 tabulate : 0.9.0 xarray : 2023.5.0 xlrd : 2.0.1 zstandard : 0.19.0 tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: adrien-berchet

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3))

crashes as reported while

df = pd.DataFrame({"a": np.zeros(4)})
df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

In [2]: df
Out[2]: 
     a                                   b
0  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
1  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
2  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
3  0.0  [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data)
    651         raw_lengths.append(len(val))
    652     elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 653         raise ValueError("Per-column arrays must each be 1-dimensional")
    655 if not indexes and not raw_lengths:
    656     raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

Comment From: determ1ne

I had the same issue, anything new on this?

Also, note that it only happens for numpy.array objects. Casting it to list works properly:

python df = pd.DataFrame({"a": np.zeros(4)}) df["b"] = np.zeros((4, 2, 3))

crashes as reported while

python df = pd.DataFrame({"a": np.zeros(4)}) df["b"] = np.zeros((4, 2, 3)).tolist()

works and the DF is:

python In [2]: df Out[2]: a b 0 0.0 [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]] 1 0.0 [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]] 2 0.0 [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]] 3 0.0 [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]

which is what I wanted to achieve before I got this issue.

Finally, trying to use the constructor detects the issue and crashes with a more understandable error:

```python pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

ValueError Traceback (most recent call last) Cell In[3], line 1 ----> 1 pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))})

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/frame.py:664, in DataFrame.init(self, data, index, columns, dtype, copy) 658 mgr = self._init_mgr( 659 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy 660 ) 662 elif isinstance(data, dict): 663 # GH#38939 de facto copy defaults to False only in non-dict cases --> 664 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) 665 elif isinstance(data, ma.MaskedArray): 666 import numpy.ma.mrecords as mrecords

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy) 489 else: 490 # dtype check to exclude e.g. range objects, scalars 491 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays] --> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate) 115 if verify_integrity: 116 # figure out the index, if necessary 117 if index is None: --> 118 index = _extract_index(arrays) 119 else: 120 index = ensure_index(index)

File ~/.virtualenvs/AxonSynthesis/lib/python3.10/site-packages/pandas/core/internals/construction.py:653, in _extract_index(data) 651 raw_lengths.append(len(val)) 652 elif isinstance(val, np.ndarray) and val.ndim > 1: --> 653 raise ValueError("Per-column arrays must each be 1-dimensional") 655 if not indexes and not raw_lengths: 656 raise ValueError("If using all scalar values, you must pass an index")

ValueError: Per-column arrays must each be 1-dimensional ```

(and again, casting to a list also works properly in this case)

EDIT: I can reproduce this issue with pandas==1.5.3 and pandas==2.2.1.

The last code snippet pd.DataFrame({"a": np.zeros(4), "b": np.zeros((4, 2, 3))}) worked properly and raised the corresponding error. Casting to a list works because lists are always 1-d.

I didn't dig into how pandas deal with memory when PR #53367 is opened, but managed to block invalid DataFrame.__setitem__ as consistence to the ValueError you mentioned.

Comment From: ebo

I have started working with CryoSat-2 data, and the RADAR waveform is 3D, and while I can convert it with np.tolist, it would be nice if there was some way to use non 1-dimentional data within a DataFrame.

For reference, here is a trivial script to replicate. The data is available from ESA https://earth.esa.int/eogateway/missions/cryosat/data.

import pandas as pd from netCDF4 import Dataset

fname = "CS_OFFL_SIR_SAR_1B_20220302T004121_20220302T004737_E001.nc" with Dataset(fname, mode='r') as CS: test_df = pd.DataFrame({ 'Waveform' : CS.variables['pwr_waveform_20_ku'][:] })

This gives the error: "ValueError: Per-column arrays must each be 1-dimensional"

If anyone knows of any tricks to get Pandas to work with multidimentional data without casting to a list, please let me know.