Pandas BUG: setting column with 2D object array raises

Research

[x] I have searched the [pandas] tag on StackOverflow for similar questions.
[x] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/79457029/setting-pandas-dataframe-column-with-numpy-object-array-causes-error/

Question about pandas

I found that setting pandas DataFrame column with a 2D numpy array whose dtype is object will cause a wierd error. I wonder why it happens.

The code I ran is as follows:

import numpy as np
import pandas as pd

print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
})

t1 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]])
data["c1"] = t1 # This works well

t2 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]], dtype=object)
data["c1"] = t2 # This throws an error

Result (some unrelated path removed):

numpy version: 2.2.3
pandas version: 2.2.3
Traceback (most recent call last):
  File "...\test.py", line 15, in <module>
    data["c1"] = t2 # This throws an error
    ~~~~^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4311, in __setitem__
    self._set_item(key, value)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4524, in _set_item
    value, refs = self._sanitize_column(value)
                  ~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 5267, in _sanitize_column
    arr = sanitize_array(value, self.index, copy=True, allow_2d=True)
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\construction.py", line 606, in sanitize_array
    subarr = maybe_infer_to_datetimelike(data)
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\dtypes\cast.py", line 1181, in maybe_infer_to_datetimelike
    raise ValueError(value.ndim)  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 2

I'm not sure whether it is the expected behaviour. I find it strange because simply adding dtype=object will cause the error.

Comment From: Abhibhav2003

Hey @tonyyuyiding ,

Actually if you see the default behavior of numpy, Numpy detects that all elements are strings and assigns dtype='<U1', meaning Unicode strings of length 1.

But when you explicitly mention dtype as "object" each element is treated as a general Python object, rather than a NumPy-native type. So, instead of storing elements in a contiguous block of memory, NumPy does not store the actual values directly but instead stores pointers (references) to Python objects. This makes dtype=object behave differently from other NumPy data types..

When you assign t2 to data as data["c1"] = t2 , pandas expects a 1-D array, However, t2 is technically a nested structure (a 2D array where each element is a separate Python object holding a list-like value). This conflicts with Pandas' column format, leading to an error.

What is the fix ? You can actually flatten t2 into a 1D structure, by using ravel() function.

Just Like this :

Comment From: tonyyuyiding

Thanks for the explanation! I have a further question. You mentioned that pandas expects a 1-D array, but I think t1 and t2 are both 2D. Why we can assign t1 to a column but not t2? Is it because "dtype=object behave differently from other NumPy data types"?

Comment From: Abhibhav2003

Even though t1 is 2D, it contains a single column. Pandas automatically reshapes it to 1D when assigning to a DataFrame column.

But in the case of t2, Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string. Since dtype=object, Pandas does NOT automatically reshape it. The shape mismatch causes an assignment error.

Because t2 contains references to python objects not just direct values.

Comment From: rhshadrach

Thanks for the report!

Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string. Since dtype=object, Pandas does NOT automatically reshape it.

@Abhibhav2003 - what are you basing this off of?

This looks like a bug to me. In the object case, pandas calls maybe_infer_to_datetimelike which raises on ndim != 1 with the comment # Caller is responsible. Further investigations are welcome!

Comment From: tonyyuyiding

I also think this looks like a bug now. At least the error message can be more informative. I'm reading through the source code and trying to find what's happening.

Comment From: chilin0525

FYI, I tested the following case, and it works when assigning values to multiple columns.

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
    "c2": [1, 2, 3, 4, 5],
})
t3 = np.array([["A", "F"], ["B", "G"], ["C", "H"], ["D", "I"], ["E", "J"]], dtype=object)
data[["c1", "c2"]] = t3

Comment From: tonyyuyiding

Thanks for the information!

I also find another stange behavior

import numpy as np
import pandas as pd

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
})

t = np.array([[["A"]], [["B"]], [["C"]], [["D"]], [["E"]]]) # shape: (5, 1, 1). dtype is not set to object
data["c1"] = t # error

Here's what I get:

Traceback (most recent call last):
  File ".../test.py", line 9, in <module>
    data["c1"] = t
    ~~~~^^^^^^
  File ".../site-packages/pandas/core/frame.py", line 4185, in __setitem__
    self._set_item(key, value)
  File ".../site-packages/pandas/core/frame.py", line 4391, in _set_item
    self._set_item_mgr(key, value, refs)
  File ".../site-packages/pandas/core/frame.py", line 4360, in _set_item_mgr
    self._iset_item_mgr(loc, value, refs=refs)
  File ".../site-packages/pandas/core/frame.py", line 4349, in _iset_item_mgr
    self._mgr.iset(loc, value, inplace=inplace, refs=refs)
  File ".../site-packages/pandas/core/internals/managers.py", line 1231, in iset
    raise AssertionError(
AssertionError: Shape of new values must be compatible with manager shape

I wonder whether it is the expected behavior. It seems that there can be more meaningful error messages when the array's dimension >= 3. Besides, I have no idea whether a 2D numpy array should be accepted when setting a column in the original design.

Comment From: tonyyuyiding

take