Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Hello, I am writing to suggest a potential improvement in the time efficiency of Pandas. This pertains to a performance issue similar to the one highlighted in xref. #54550. However, its PR, #54746, overlooks the aspect of adjusting the threshold for get_indexer_non_unique in the class MaskedIndexEngine. My suggestion is to revise this limit, aligning with the strategy adopted in #54746, specifically setting it to len(targets) < (n / (2 * n.bit_length())). I believe this adjustment could positively impact the performance.

I am willing to create a pull request for this if you believe it would be beneficial.

import random
import time
import pandas as pd
import numpy as np

if __name__ == "__main__":
    # Create a large pandas dataframe with non-unique indexes and some NaN values
    table_size = 10_000_000
    num_index = 1_000_000
    data = [1] * table_size
    # Introduce NaNs into the data
    for _ in range(table_size // 10):  # Introduce NaNs in 10% of the data
        data[random.randint(0, table_size - 1)] = np.nan
    df = pd.DataFrame(data)
    index = random.choices(range(num_index), k=table_size)
    df.index = index
    df = df.sort_index()

    # Pre-query the index to force optimizations.
    df.loc[[5, 6, 7, 456, 65743]]
    df.loc[[1000]]

    # Testing 'df.loc' with all at once using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        df.loc[indexes]
        measure = time.monotonic() - start
        print(f"With all at once (masked data): num_indexes={i+1} => {measure:.5f}s")

    print("---")

    # Testing 'df.loc' one at a time using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        pd.concat([df.loc[[idx]] for idx in indexes])
        measure = time.monotonic() - start
        print(f"With one at a time (masked data): num_indexes={i+1} => {measure:.5f}s")

printed result:


With all at once (masked data): num_indexes=1 => 0.00045s
With all at once (masked data): num_indexes=2 => 0.00048s
With all at once (masked data): num_indexes=3 => 0.00052s
With all at once (masked data): num_indexes=4 => 0.00050s
With all at once (masked data): num_indexes=5 => 0.64931s
With all at once (masked data): num_indexes=6 => 0.65066s
With all at once (masked data): num_indexes=7 => 0.65181s
With all at once (masked data): num_indexes=8 => 0.80003s
With all at once (masked data): num_indexes=9 => 0.65251s
With all at once (masked data): num_indexes=10 => 0.66629s
---
With one at a time (masked data): num_indexes=1 => 0.00081s
With one at a time (masked data): num_indexes=2 => 0.00134s
With one at a time (masked data): num_indexes=3 => 0.00114s
With one at a time (masked data): num_indexes=4 => 0.00173s
With one at a time (masked data): num_indexes=5 => 0.00132s
With one at a time (masked data): num_indexes=6 => 0.00191s
With one at a time (masked data): num_indexes=7 => 0.20084s
With one at a time (masked data): num_indexes=8 => 0.00180s
With one at a time (masked data): num_indexes=9 => 0.00201s
With one at a time (masked data): num_indexes=10 => 0.00169s

Installed Versions

commit : https://github.com/pandas-dev/pandas/commit/a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.9.18.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-88-generic Version : https://github.com/pandas-dev/pandas/issues/98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.4 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.18.1 pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None

Prior Performance

No response