df = pd.DataFrame(
{
"B": [1, None, 3],
"C": pd.array([1, None, 3], dtype="Int64"),
}
)
result = df.skew()
>>> result
B <NA>
C <NA>
dtype: Float64
>>> df[["B"]].skew()
B NaN
dtype: float64
Based on test_mixed_reductions. The presence of column "C" shouldn't affect the result we get for column "B".
Comment From: rhshadrach
Expected Behavior
NA
I would expect Float64 dtype with NaN and NA. One might also argue object, but so far it appears we coerce to the nullable dtypes.
df = pd.DataFrame(
{
"B": [1, 2, 3],
"C": pd.array([1, 2, 3], dtype="Float64"),
}
)
print(df.sum())
# B 6.0
# C 6.0
# dtype: Float64
Comment From: jbrockmendel
Hah, in the expected behavior section i put "NA" to mean "I'm not writing anything here", not pd.NA.
Comment From: jorisvandenbossche
I would say that the current behaviour is fine. We indeed currently coerce to the nullable dtype for the result if there are mixed nullable and non-nullable columns, and at that point converting NaN to NA seems the correct thing to do (if you would first cast the original non-nullable column to its nullable dtype, you would get the same result)
Comment From: jbrockmendel
I would say that the current behaviour is fine
It's fine in a never-distinguish world. We're currently in a sometimes-distinguish world in which I had previously thought the rule was "distinguish when NaNs are introduced through operations". We can make that rule more complicated "distinguish when NaNs are introduced through operations but not reductions", but I do think that makes it a worse rule.
if you would first cast the original non-nullable column to its nullable dtype
In this case the user specifically didn't do that.
Comment From: rhshadrach
We indeed currently coerce to the nullable dtype for the result if there are mixed nullable and non-nullable columns
Should coercion to Float64 change NaN to NA?
ser = pd.Series([np.nan, pd.NA], dtype="object")
print(ser.astype("Float64"))
# 0 <NA>
# 1 <NA>
# dtype: Float64
Comment From: jbrockmendel
Should coercion to Float64 change NaN to NA?
That is expected behavior ATM, yes. In an always-distinguish world it would not (xref #62040)
Comment From: sharkipelago
take
Comment From: jorisvandenbossche
@sharkipelago this is not the best issue to try to tackle at this point, as there is not yet a clear action to be taken
Comment From: jorisvandenbossche
if you would first cast the original non-nullable column to its nullable dtype
In this case the user specifically didn't do that.
The user indeed did not specifically cast themselves, but if you do a reduction operation on a DataFrame with multiple columns (and multiple dtypes), you are always doing an implicit cast. Also if you have an integer and float column, the integer result gets cast to float.
If you consider a reduction on a DataFrame as reducing each column separately, ad then a concat of the results, then again the current behaviour is kind of the expected behaviour, because concatting float64 and Float64 gives Float64, converting NaNs to NA:
>>> pd.concat([pd.Series([np.nan], dtype="float64"), pd.Series([pd.NA], dtype="Float64")])
0 <NA>
0 <NA>
dtype: Float64
Here the float64 series gets cast to the common dtype, i.e. Float64, and casting float64 to nullable float64 converts NaNs to NA. You can of course question this casting behaviour, but as you mention in the comment above, this is the expected behaviour at the moment.
Comment From: sharkipelago
@sharkipelago this is not the best issue to try to tackle at this point, as there is not yet a clear action to be taken
Understood. Thanks for letting me know. I had mistakenly thought the behaviour should be updated so that (for the skew example above) column "B" always becomes NaN regardless of if column "C" is present in the DataFrame. Is there a way to un-assign myself?
Comment From: rhshadrach
Agree with OP that users will find it surprising that the presence of another column will change the result. But unless we're going to go to object dtype this coercion needs to be a part of general reductions since we go from horizontal to vertical. So I think the "right" user expectation is for reducers to compute and then coerce as necessary. Assuming this, current behavior is correct.
I do find coercion converting NaN to NA surprising but that's for a separate issue.
Comment From: jbrockmendel
Since the current behavior is "correct", i'm updating the title from BUG to API. I maintain that if I was surprised, users will be surprised.
Comment From: jorisvandenbossche
The solution to all this surprising behaviour is of course to eventually only have dtypes that use NA for missing data, so we don't run in those "mixed" dataframes with some NA-based columns and some NaN-based columns (or I think as someone suggested in an earlier related discussion, prohibit such mixed dataframes for now until we get there, but personally I don't think that is going to be practical)
Comment From: jbrockmendel
The solution to all this surprising behaviour is of course to eventually only have dtypes that use NA
That'll be great eventually. I would also consider the issue solved if we got to either "always distinguish" or "never distinguish" so there wouldn't be any confusion as to when we distinguish.