Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
~pd.Series([True, False]) => [False, True]
~pd.Series([True, False, pd.NA]) => [-2, -1, <NA>]
~pd.Series([True, False, pd.NA], dtype=pd.BooleanDtype()) => [False, True, <NA>]
Issue Description
In the example it is of course possible to work around the issue with forcing the dtype, but in the context of actual dataframes coming from real databases, NULLs are still best handled by simply replacing them with empty strings and hoping for the best, because no NULL-like object seems to work well with ordinary string and filter operations.
As a sidenote, if the pd.NA is instead a None, a TypeError is thrown.
#59831 is a related issue.
Expected Behavior
I'd expect the T/F values to flip and the NAs to remain by default, as they do when the dtype is forced.
Installed Versions
Comment From: rhshadrach
Thanks for the report. A possible resolution would be to infer Int64
dtype from the list [1, 2, pd.NA]
, but this would be a bit too fragile: you'd get int64
if there was no pd.NA
, but Int64
when there is. That seems highly undesirable to me. When pandas switches to nullable arrays to be the default, this issue will go away. Until then, I think the only resolution is to specify the dtype or use Series.convert_dtypes
When you have object
dtype, it seems correct to me to apply the negation (~
) operation corresponding to the underlying Python object.
print(~0)
# -1
because no NULL-like object seems to work well with ordinary string and filter operations.
I do not understand this comment, can you not use dtype=string
?
Comment From: RedAero
I do not understand this comment, can you not use
dtype=string
?
Well, for a start because string would be undefined (you mean str), but the point is that if you have a Series with anything null-like, they have to be explicitly handled every time, otherwise errors are thrown. In SQL, if I query a table and use something like WHERE LEFT(customer_name, 1) = 'E"
, it's not going to error out if the ID field has NULLs.
Pandas does:
df.loc[df["customer_name"].str.startswith("E")]
ValueError: Cannot mask with non-boolean array containing NA / NaN values
And there's no easy way to deal with this - or, well, there is: fillna('some-string-that-won't-appear-in-the-data')
Which should give everyone the ick. You can also use True/False instead, but that has it's own, obvious problems.
Comment From: rhshadrach
Well, for a start because string would be undefined (you mean str)
Ah, sorry, I mean dtype="string"
Comment From: simonjayhawkins
xref #32931