The below table gives an overview of the result value for:
missing_value in idx
i.e. how Index.__contains__
handles various missing value sentinels as input for the different data types.
dtype | None | nan | \<NA> | NaT |
---|---|---|---|---|
object-none | True | False | False | False |
object-nan | False | True | False | False |
object-NA | False | False | True | False |
datetime | True | True | True | True |
period | True | True | True | True |
timedelta | True | True | True | True |
float64 | False | True | False | False |
categorical | True | True | True | True |
interval | True | True | True | False |
nullable_int | False | False | True | False |
nullable_float | False | False | True | False |
string-python | False | False | False | False |
string-pyarrow | False | False | False | False |
str-python | False | False | False | False |
The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype
But more in general, this is quite inconsistent:
- For object dtype, we require exact match
- For datetimelike and categorical, we match any missing-like
- For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
- For float we only match NaN
- For nullable dtypes (int/float), we only match NA
The code to generate the table above:
import numpy as np
import pandas as pd
# from conftest.py
indices_dict = {
"object-none": pd.Index(["a", None], dtype=object),
"object-nan": pd.Index(["a", np.nan], dtype=object),
"object-NA": pd.Index(["a", pd.NA], dtype=object),
"datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
"period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
"timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
"float64": pd.Index([2.0, np.nan], dtype="float64"),
"categorical": pd.CategoricalIndex(["a", None]),
"interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
"nullable_int": pd.Index([2, None], dtype="Int64"),
"nullable_float": pd.Index([2.0, None], dtype="Float32"),
"string-python": pd.Index(["a", None], dtype="string[python]"),
"string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
"str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}
results = []
for dtype, data in indices_dict.items():
for val in [None, np.nan, pd.NA, pd.NaT]:
res = val in data
results.append((dtype, str(val), res))
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())
print(df_overview.astype(str).to_markdown())
cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything
Comment From: jbrockmendel
im not aware of a dedicated issue for this either. i think at one point I made a PR trying to make more of the EA subclasses use is_valid_na_for
but that got tabled pending the nan-vs-na topic.
For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex
should always be False
). Also Decimal("NaN")
should be handled correctly.
Comment From: jorisvandenbossche
For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e.
np.timedelta64("NaT") in my_datetimeindex
should always beFalse
).
Indeed, the np.timedelta64("NaT")
and np.datetime64("NaT")
only give True for timedelta/datetime index, respectively, and all other index dtypes return False for those, with one exception: categorical.
Also
Decimal("NaN")
should be handled correctly.
In the sense that it is not matched in general (again, except for categorical ..). But it seems also not be matched for object dtype with such decimal: Decimal("NaN") in pd.Index([Decimal("2.0"), Decimal("NaN")], dtype=object)
gives False
.
Expanded table:
dtype | None | nan | \<NA> | NaT | np.datetime64('NaT') | np.timedelta64('NaT') | Decimal('NaN') |
---|---|---|---|---|---|---|---|
object-none | True | False | False | False | False | False | False |
object-nan | False | True | False | False | False | False | False |
object-NA | False | False | True | False | False | False | False |
object-decimal-NaN | False | False | False | False | False | False | False |
datetime | True | True | True | True | True | False | False |
period | True | True | True | True | False | False | False |
timedelta | True | True | True | True | False | True | False |
float64 | False | True | False | False | False | False | False |
categorical | True | True | True | True | True | True | True |
interval | True | True | True | False | False | False | False |
nullable_int | False | False | True | False | False | False | False |
nullable_float | False | False | True | False | False | False | False |
string-python | False | False | False | False | False | False | False |
string-pyarrow | False | False | False | False | False | False | False |
str-python | False | False | False | False | False | False | False |
import numpy as np
import pandas as pd
from decimal import Decimal
# from conftest.py
indices_dict = {
"object-none": pd.Index(["a", None], dtype=object),
"object-nan": pd.Index(["a", np.nan], dtype=object),
"object-NA": pd.Index(["a", pd.NA], dtype=object),
"object-decimal-NaN": pd.Index(["a", Decimal("NaN")], dtype=object),
"datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
"period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
"timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
"float64": pd.Index([2.0, np.nan], dtype="float64"),
"categorical": pd.CategoricalIndex(["a", None]),
"interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
"nullable_int": pd.Index([2, None], dtype="Int64"),
"nullable_float": pd.Index([2.0, None], dtype="Float32"),
"string-python": pd.Index(["a", None], dtype="string[python]"),
"string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
"str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}
results = []
for dtype, data in indices_dict.items():
for val in [None, np.nan, pd.NA, pd.NaT, np.datetime64("NaT"), np.timedelta64("NaT"), Decimal("NaN")]:
res = val in data
results.append((dtype, repr(val), res))
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())
print(df_overview.astype(str).to_markdown())