Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

python
import geopandas as gpd
from shapely.geometry import Polygon, MultiPolygon

gdf = pgd.Geopandas([{"ID1": "3991b7ab", "ID2": nan,"geometry": POLYGON ((...)), "area_m": 10720.28326}, 
{"ID1": "3991b7ab","ID2": "dc4772ed", "geometry": MULTIPOLYGON(((...)), "area": 0.24245}]

gdf.assign(area=lambda x: x.geometry.area).sort_values(by="area", ascending=False).query("FLUR_OBJECT_UUI == '3991b7ab'").groupby(by="ID1").first()

Issue Description

I have a (Geo)DataFrame with Multiple UUID columns. When sorting the rows, the first columns, can have a None value. After groupby if ID1, the None value of ID2 is replaced withe next value in that column.

Expected Behavior

The value of ID2 should be None.

Installed Versions

NSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 2.2.2 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.1.0 pip : 24.0 Cython : None pytest : 8.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : None numba : 0.59.0 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.30 tables : None tabulate : None xarray : 2024.1.1 xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. Can you simplify this? Produce the DataFrame that goes into the groupby explicitly (i.e. remove the use of assign, sort_values, and query), and show the result you get.

Comment From: asishm

df.groupby.first has a skipna argument that defaults to True https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html

Try setting that to False.

In [190]: pd.DataFrame({'a': [1, 1, 2, 2], 'b': [np.nan, '2', '3', '4']}).groupby('a').first(skipna=False)
Out[190]:
     b
a
1  NaN
2    3

In [191]: pd.DataFrame({'a': [1, 1, 2, 2], 'b': [np.nan, '2', '3', '4']}).groupby('a').first()
Out[191]:
   b
a
1  2
2  3

Comment From: sehHeiden

Okay, my bad.

Made sure, that I have > 2.2.1 on all machines. skipna, did help. Problem was, when I search for the documentation for first. I find the DataFrame.first and not the DataFrameGroupBy.first. I think, this is a problem, but something different.

Comment From: rhshadrach

@sehHeiden - it's not clear to me if your issue is resolved or not. If it isn't resolved, can you post a reproducible example (see my previous comment).

Comment From: sehHeiden

Solved. @asishm s solution works. Tested with 2.2.2, but does not work with 2.2.0 as it does not have the skipna parameter.

Optional wish: Do link GroupBy.first in the DataFrame.first documentation.