Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
df[[1,2]].loc[0].dtypes

Issue Description

df.loc[0,[1,2]] results in a Series of type dtype('O'), while df[[1,2]].loc[0] results in a Series of type dtype('float64').

Expected Behavior

I would expect df.loc[0,[1,2]] to be of type float64, same as df[[1,2]].loc[0]. The current behavior seems to encourage chaining instead of canonical referencing.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.8 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Thu Sep 12 23:35:10 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T6030 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 24.3.1 Cython : None sphinx : 8.1.3 IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : 1.4.2 dataframe-api-compat : None fastparquet : None fsspec : 2024.12.0 html5lib : 1.1 hypothesis : None gcsfs : None jinja2 : 3.1.5 lxml.etree : 5.3.0 matplotlib : 3.10.0 numba : 0.60.0 numexpr : 2.10.2 odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 18.1.0 pyreadstat : None pytest : 8.3.4 python-calamine : None pyxlsb : None s3fs : 2024.12.0 scipy : 1.14.1 sqlalchemy : 2.0.36 tables : 3.10.1 tabulate : 0.9.0 xarray : 2024.11.0 xlrd : None xlsxwriter : None zstandard : 0.23.0 tzdata : 2024.2 qtpy : N/A pyqt5 : None

Comment From: rhshadrach

Thanks for the report. I'd hazard a guess that we are determining the dtype of the result prior to column selection. Further investigations are welcome!

Comment From: parthi-siva

take

Comment From: DarthKitten2130

take

Comment From: sanggon6107

Hi @parthi-siva and @DarthKitten2130 , Are you still working on this issue? I would like to work on this one if you don't mind.

Comment From: parthi-siva

Hi @sanggon6107 I'm still working on this..

Comment From: sanggon6107

Hi @sanggon6107 I'm still working on this..

Well noted. Thanks for the quick reply.

Comment From: parthi-siva

for this input

df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:131 (Here we get the data type for the resulting series. )

 dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
        arr, indexer, fill_value, allow_fill
    )
arr = [a, 1, 2]
indexer = [1,2]

as we can see that arr contains string so the datatype returned will be object only.

Then we are creating empty numpy array using the dtype which will be of type object

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:155

out = np.empty(out_shape, dtype=dtype) 

Then we do slice using cpython function

pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:160

func(arr, indexer, out, fill_value)

After the func(arr, indexer, out, fill_value) call, the out array is populated with the selected elements. However, the dtype of out will not match the dtype of the elements in arr.

I tried to add a step to check and adjust the dtype of out after the func call.

  # Check if the dtype of out matches the dtype of the elements in arr
    if out.size > 0:  # Only check if out is not empty
        first_element = out.flat[0]  # Get the first element

        # Check if the first element's type is different from out.dtype
        if isinstance(first_element, (int, float, np.number)) and out.dtype == object:
            # If the first element is numeric but out.dtype is object, update the dtype
            new_dtype = np.result_type(first_element)
            out = out.astype(new_dtype)

This fixed the op's issue but test cases are failing. Also I feel this is not a right way to address the issue

So I'm not sure how to infer the dtype pragmatically before this for df.loc[0,[1,2]]

@rhshadrach @sanggon6107

Comment From: sanggon6107

Hi @Parthi-siva, thanks for the comment.

I had also tried similar thing, but it seems there could be side effects including test failures, since there could be many other pandas functions that call take_nd(). I would rather change some codes at the relatively outer level of the call stack so that we can minimize the impact. Since it seems this issue only appears where the first axis is integer and the second one is list or slice - loc[int,list/slice], I think we could re-interpret the dtype of the output at the end of _LocationIndexer._getitem_lowerdim().

Proposed solution

    @final
    def _getitem_lowerdim(self, tup: tuple):

...

                # This is an elided recursive call to iloc/loc
                out = getattr(section, self.name)[new_key]
                # Re-interpret dtype of out.values for loc/iloc[int, list/slice]. # GH60600
                if i == 0 and isinstance(key, int) and isinstance(new_key, (list, slice)):
                    inferred_dtype = np.array(out.values.tolist()).dtype
                    if inferred_dtype != out.dtype:
                        out = out.astype(inferred_dtype)
                return out

There was only one failing test when I locally ran pytest, but the failing case should be revised according to this code change since the test is currently expecting loc[int, list] to be an object dataframe. My concern is that, we have to create a new np.array only to re-interpret the dtype. I'm not sure if there's more elegant way to infer the output's dtype.

Please let me know what you think about the proposal. I'd be glad to co-author a commit and make a PR if you don't mind.

cc @rhshadrach

Thanks!

Comment From: parthi-siva

Hi @sanggon6107 ,

Thanks for the reply.

Pls proceed with you proposal. I'm good!

I spent some time regarding your concern about creating a np.array just to find dtype.

can we try using np.result_type

either like this

inferred_dtype = reduce(np.result_type, out)

or like this

inferred_dtype = np.result_type(out.values.tolist())

Please let me know if it helps.

Comment From: rhshadrach

@sanggon6107 - it's not clear to me what the proposal is. Best to open a PR I think.

Comment From: sanggon6107

Hi @parthi-siva , your suggestion helped a lot!

I've also found that we could simplify the code by using infer_objects(). I'll make a PR based on this discussion.

Comment From: sanggon6107

take

Comment From: parthi-siva

Hi @parthi-siva , your suggestion helped a lot!

I've also found that we could simplify the code by using infer_objects(). I'll make a PR based on this discussion.

Sure @sanggon6107 :)