Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Hello, I am writing to suggest a potential improvement in the time efficiency of Pandas. This pertains to a performance issue similar to the one highlighted in xref. #54550. However, its PR, #54746, overlooks the aspect of adjusting the threshold for get_indexer_non_unique
in the class MaskedIndexEngine
. My suggestion is to revise this limit, aligning with the strategy adopted in #54746, specifically setting it to len(targets) < (n / (2 * n.bit_length())). I believe this adjustment could positively impact the performance.
I am willing to create a pull request for this if you believe it would be beneficial.
import random
import time
import pandas as pd
import numpy as np
if __name__ == "__main__":
# Create a large pandas dataframe with non-unique indexes and some NaN values
table_size = 10_000_000
num_index = 1_000_000
data = [1] * table_size
# Introduce NaNs into the data
for _ in range(table_size // 10): # Introduce NaNs in 10% of the data
data[random.randint(0, table_size - 1)] = np.nan
df = pd.DataFrame(data)
index = random.choices(range(num_index), k=table_size)
df.index = index
df = df.sort_index()
# Pre-query the index to force optimizations.
df.loc[[5, 6, 7, 456, 65743]]
df.loc[[1000]]
# Testing 'df.loc' with all at once using a list of indexes, on masked data.
for i in range(10):
indexes = random.sample(list(df.index), k=i+1)
start = time.monotonic()
df.loc[indexes]
measure = time.monotonic() - start
print(f"With all at once (masked data): num_indexes={i+1} => {measure:.5f}s")
print("---")
# Testing 'df.loc' one at a time using a list of indexes, on masked data.
for i in range(10):
indexes = random.sample(list(df.index), k=i+1)
start = time.monotonic()
pd.concat([df.loc[[idx]] for idx in indexes])
measure = time.monotonic() - start
print(f"With one at a time (masked data): num_indexes={i+1} => {measure:.5f}s")
printed result:
With all at once (masked data): num_indexes=1 => 0.00045s
With all at once (masked data): num_indexes=2 => 0.00048s
With all at once (masked data): num_indexes=3 => 0.00052s
With all at once (masked data): num_indexes=4 => 0.00050s
With all at once (masked data): num_indexes=5 => 0.64931s
With all at once (masked data): num_indexes=6 => 0.65066s
With all at once (masked data): num_indexes=7 => 0.65181s
With all at once (masked data): num_indexes=8 => 0.80003s
With all at once (masked data): num_indexes=9 => 0.65251s
With all at once (masked data): num_indexes=10 => 0.66629s
---
With one at a time (masked data): num_indexes=1 => 0.00081s
With one at a time (masked data): num_indexes=2 => 0.00134s
With one at a time (masked data): num_indexes=3 => 0.00114s
With one at a time (masked data): num_indexes=4 => 0.00173s
With one at a time (masked data): num_indexes=5 => 0.00132s
With one at a time (masked data): num_indexes=6 => 0.00191s
With one at a time (masked data): num_indexes=7 => 0.20084s
With one at a time (masked data): num_indexes=8 => 0.00180s
With one at a time (masked data): num_indexes=9 => 0.00201s
With one at a time (masked data): num_indexes=10 => 0.00169s
Installed Versions
Prior Performance
No response