Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
df["feature"] = np.nan for cluster in df["cluster"].unique(): df.loc[df["cluster"] == cluster, "feature"] = "string"
Installed Versions
Prior Performance
Setup
Dataset: df with 148,858 rows
Task: Assign "string" to a new column "feature" based on unique values in the "cluster" column.
Environment: Running on LSF
Test 1: Initialize with np.nan
import numpy as np
df["feature"] = np.nan for cluster in df["cluster"].unique(): df.loc[df["cluster"] == cluster, "feature"] = "string"
Runtime: ~52.5 seconds
Warning:
FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas.
Value 'string' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
Test 2: Initialize with "None"
df["feature"] = "None" for cluster in df["cluster"].unique(): df.loc[df["cluster"] == cluster, "feature"] = "string"
Runtime: ~1 minute 35 seconds
No warnings
Observation: Slower performance despite avoiding the dtype mismatch warning.
Comment From: rhshadrach
@muhannad125 - please provide a reproducible example. Setup an example df
with synthetic data.
You would might be interested in using map if performance is a concern.
Comment From: mroeschke
Closing as needing more information