Research
-
[x] I have searched the [pandas] tag on StackOverflow for similar questions.
-
[x] I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/79457029/setting-pandas-dataframe-column-with-numpy-object-array-causes-error/
Question about pandas
I found that setting pandas DataFrame column with a 2D numpy array whose dtype is object will cause a wierd error. I wonder why it happens.
The code I ran is as follows:
import numpy as np
import pandas as pd
print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
data = pd.DataFrame({
"c1": [1, 2, 3, 4, 5],
})
t1 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]])
data["c1"] = t1 # This works well
t2 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]], dtype=object)
data["c1"] = t2 # This throws an error
Result (some unrelated path removed):
numpy version: 2.2.3
pandas version: 2.2.3
Traceback (most recent call last):
File "...\test.py", line 15, in <module>
data["c1"] = t2 # This throws an error
~~~~^^^^^^
File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4311, in __setitem__
self._set_item(key, value)
~~~~~~~~~~~~~~^^^^^^^^^^^^
File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4524, in _set_item
value, refs = self._sanitize_column(value)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^
File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 5267, in _sanitize_column
arr = sanitize_array(value, self.index, copy=True, allow_2d=True)
File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\construction.py", line 606, in sanitize_array
subarr = maybe_infer_to_datetimelike(data)
File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\dtypes\cast.py", line 1181, in maybe_infer_to_datetimelike
raise ValueError(value.ndim) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 2
I'm not sure whether it is the expected behaviour. I find it strange because simply adding dtype=object
will cause the error.
Comment From: Abhibhav2003
Hey @tonyyuyiding ,
Actually if you see the default behavior of numpy, Numpy detects that all elements are strings and assigns dtype='<U1', meaning Unicode strings of length 1.
But when you explicitly mention dtype as "object" each element is treated as a general Python object, rather than a NumPy-native type. So, instead of storing elements in a contiguous block of memory, NumPy does not store the actual values directly but instead stores pointers (references) to Python objects. This makes dtype=object behave differently from other NumPy data types..
When you assign t2 to data as data["c1"] = t2 , pandas expects a 1-D array, However, t2 is technically a nested structure (a 2D array where each element is a separate Python object holding a list-like value). This conflicts with Pandas' column format, leading to an error.
What is the fix ? You can actually flatten t2 into a 1D structure, by using ravel() function.
Just Like this :
Comment From: tonyyuyiding
Thanks for the explanation! I have a further question. You mentioned that pandas expects a 1-D array, but I think t1
and t2
are both 2D. Why we can assign t1
to a column but not t2
? Is it because "dtype=object behave differently from other NumPy data types"?
Comment From: Abhibhav2003
Even though t1 is 2D, it contains a single column. Pandas automatically reshapes it to 1D when assigning to a DataFrame column.
But in the case of t2, Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string. Since dtype=object, Pandas does NOT automatically reshape it. The shape mismatch causes an assignment error.
Because t2 contains references to python objects not just direct values.
Comment From: rhshadrach
Thanks for the report!
Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string. Since dtype=object, Pandas does NOT automatically reshape it.
@Abhibhav2003 - what are you basing this off of?
This looks like a bug to me. In the object case, pandas calls maybe_infer_to_datetimelike
which raises on ndim != 1
with the comment # Caller is responsible
. Further investigations are welcome!
Comment From: tonyyuyiding
I also think this looks like a bug now. At least the error message can be more informative. I'm reading through the source code and trying to find what's happening.
Comment From: chilin0525
FYI, I tested the following case, and it works when assigning values to multiple columns.
data = pd.DataFrame({
"c1": [1, 2, 3, 4, 5],
"c2": [1, 2, 3, 4, 5],
})
t3 = np.array([["A", "F"], ["B", "G"], ["C", "H"], ["D", "I"], ["E", "J"]], dtype=object)
data[["c1", "c2"]] = t3
Comment From: tonyyuyiding
Thanks for the information!
I also find another stange behavior
import numpy as np
import pandas as pd
data = pd.DataFrame({
"c1": [1, 2, 3, 4, 5],
})
t = np.array([[["A"]], [["B"]], [["C"]], [["D"]], [["E"]]]) # shape: (5, 1, 1). dtype is not set to object
data["c1"] = t # error
Here's what I get:
Traceback (most recent call last):
File ".../test.py", line 9, in <module>
data["c1"] = t
~~~~^^^^^^
File ".../site-packages/pandas/core/frame.py", line 4185, in __setitem__
self._set_item(key, value)
File ".../site-packages/pandas/core/frame.py", line 4391, in _set_item
self._set_item_mgr(key, value, refs)
File ".../site-packages/pandas/core/frame.py", line 4360, in _set_item_mgr
self._iset_item_mgr(loc, value, refs=refs)
File ".../site-packages/pandas/core/frame.py", line 4349, in _iset_item_mgr
self._mgr.iset(loc, value, inplace=inplace, refs=refs)
File ".../site-packages/pandas/core/internals/managers.py", line 1231, in iset
raise AssertionError(
AssertionError: Shape of new values must be compatible with manager shape
I wonder whether it is the expected behavior. It seems that there can be more meaningful error messages when the array's dimension >= 3. Besides, I have no idea whether a 2D numpy array should be accepted when setting a column in the original design.
Comment From: tonyyuyiding
take