Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
try:
s = pd.Series([1, 2, 'x', 4, 5], dtype=int)
print("1. No exception")
except Exception as ex:
print("1. Exception thrown", ex)
s = pd.Series([1, 2, 'x', 4, 5])
try:
s2 = pd.Series(s, dtype=int)
print("2. No exception")
except Exception as ex:
print("2. Exception thrown", ex)
print("s2 dtype", s2.dtype)
Issue Description
The behavior of the above code doesn't seem right.
If I try to create a Series with dtype=int
using a list that contains a string, it throws a ValueError, as expected.
But if the input to the Series constructor is another Series that contains a string, there is no exception and it quietly returns a new Series with dtype=object
.
Is this expected? If so it's very counterintuitive. Perhaps the recommended way is
s.astype(int)
but it seems like this should produce the same behavior
Expected Behavior
In both of the cases above, it should throw a ValueError. Instead, only the first case does:
1. Exception thrown invalid literal for int() with base 10: 'x'
1. No exception
s2 dtype object
Installed Versions
Comment From: chaoyihu
@jonmooser Thanks for raising this issue! If I understand it correctly, you are trying to change the dtype of a Series
that is already initialized.
I agree it is counter-intuitive that s2 = pd.Series(s, dtype=int)
ignores the customized dtype silently without raising.
However, I don't think the constructor should be used for casting. As you have mentioned, a better way of doing this might be Series.astype
, in which case a ValueError
is raised as expected:
>>> s = pd.Series([1, 2, 'x', 4, 5]) # dtype of s: object
>>> s3 = s.astype('int64')
ValueError: invalid literal for int() with base 10: 'x'
Comment From: jonmooser
Thanks for looking into this. Glad there is a work around but it would be great if future versions are more consistent and intuitive.
Comment From: taranarmo
The constructor doesn't use dtype keyword if Series
passed in, source code link
elif isinstance(data, Series):
if index is None:
index = data.index
data = data._mgr.copy(deep=False)
else:
data = data.reindex(index)
copy = False
data = data._mgr
I guess warning might be raised if data
is Series
and dtype is not None
. If so I can make a PR.
Personally I'd add cast data to the referenced dtype to make the code fail explicitly but I assume that's breaking change and could cause problems to (some) users.
Comment From: jorisvandenbossche
The constructor doesn't use dtype keyword if
Series
passed in, source code link
The dtype
is still used later on in that case:
https://github.com/pandas-dev/pandas/blob/39bd3d38ac97177c22e68a9259bf4f09f7315277/pandas/core/series.py#L502
So while the dtype
argument indeed seems to be ignored in certain cases as the example in the top post show, it is not ignored always. Quick example:
>>> ser_float = pd.Series([1.0], dtype=float)
>>> ser_float
0 1.0
dtype: float64
>>> pd.Series(ser_float, dtype="int64")
0 1
dtype: int64
I agree it is counter-intuitive that
s2 = pd.Series(s, dtype=int)
ignores the customized dtype silently without raising.
I fully agree this is counter-intuitive, and as shown above also inconsistent. So I think we should rather try to fix this.
When passing a dtype
argument, some level of casting being performed in the constructor is unavoidable I think (although I agree that for the explicit action of casting a Series to a different dtype, using ser.astype(..)
is the better and clearer option).
Comment From: taranarmo
True, I missed that use of dtype
further. Though inside the sanitize_array
if data is Series then data is cast into numpy array (since hasattr(Series, "__array__")
is true) and then sanitize_array
is being called again, then it goes into _try_cast
, link.
if dtype == object:
if not is_ndarray:
subarr = construct_1d_object_array_from_listlike(arr)
return subarr
return ensure_wrapped_if_datetimelike(arr).astype(dtype, copy=copy)
I agree that this should be considered a bug as current behavior is more like "dtype might be ignored". My last change should be reverted I suppose.
Comment From: taranarmo
Maybe we should make dtype
being ignored if data is Series/DataFrame. Simple to fix and will make users to avoid creating Series from Series with casting to different dtype
.
Comment From: jbrockmendel
I'm seeing both (correctly) raise on main and 2.3, but the OP behavior in 2.2. @jonmooser can you confirm
Comment From: jonmooser
Hmm... I just upgraded to pandas 2.3.0 (and numpy 2.3.1) and still see no exception in the second case. Any idea why we might be seeing different behavior from the same version