Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

It is not very important but still quite surprising. unique should be the method to use and faster but is twice slower.

df=pd.DataFrame({"M": ["M1","M2"], "P": ["P1", "P2"], "V": [1.,2.]}) i = df.set_index(['M','P']).index

In [6]: %timeit i.unique("M") 30.9 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit i.get_level_values('M').drop_duplicates() 16.1 µs ± 84 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.7 python-bits : 64 OS : Linux OS-release : 6.11.5-200.fc40.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Tue Oct 22 19:13:11 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.3 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 23.3.2 Cython : 3.0.9 sphinx : None IPython : 8.23.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 html5lib : 1.1 hypothesis : None gcsfs : 2023.6.0+1.g7cc53d9 jinja2 : 3.1.4 lxml.etree : 5.1.0 matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None psycopg2 : 2.9.9 pymysql : 1.4.6 pyarrow : 17.0.0 pyreadstat : None pytest : 7.4.3 python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : 2.0.36 tables : N/A tabulate : 0.9.0 xarray : N/A xlrd : 2.0.1 xlsxwriter : 3.1.9 zstandard : 0.22.0 tzdata : 2024.2 qtpy : 2.4.1 pyqt5 : None

Prior Performance

No response

Comment From: rhshadrach

Thanks for the report. On that size of data, you're just measuring overhead.

size = 100_000
df=pd.DataFrame({"M": ["M1","M2"] * size, "P": ["P1", "P2"] * size, "V": [1.,2.] * size})
i = df.set_index(['M','P']).index

%timeit i.unique("M")
# 466 μs ± 3.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit i.get_level_values('M').drop_duplicates()
# 3.43 ms ± 12.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

.unique is 7 times faster here.