Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
df[[1,2]].loc[0].dtypes
Issue Description
df.loc[0,[1,2]]
results in a Series of type dtype('O')
, while df[[1,2]].loc[0]
results in a Series of type dtype('float64')
.
Expected Behavior
I would expect df.loc[0,[1,2]]
to be of type float64
, same as df[[1,2]].loc[0]
. The current behavior seems to encourage chaining instead of canonical referencing.
Installed Versions
Comment From: rhshadrach
Thanks for the report. I'd hazard a guess that we are determining the dtype of the result prior to column selection. Further investigations are welcome!
Comment From: parthi-siva
take
Comment From: DarthKitten2130
take
Comment From: sanggon6107
Hi @parthi-siva and @DarthKitten2130 , Are you still working on this issue? I would like to work on this one if you don't mind.
Comment From: parthi-siva
Hi @sanggon6107 I'm still working on this..
Comment From: sanggon6107
Hi @sanggon6107 I'm still working on this..
Well noted. Thanks for the quick reply.
Comment From: parthi-siva
for this input
df = pd.DataFrame([['a',1.,2.],['b',3.,4.]])
df.loc[0,[1,2]].dtypes
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:131
(Here we get the data type for the resulting series. )
dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
arr, indexer, fill_value, allow_fill
)
arr = [a, 1, 2]
indexer = [1,2]
as we can see that arr contains string so the datatype returned will be object only.
Then we are creating empty numpy array using the dtype which will be of type object
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:155
out = np.empty(out_shape, dtype=dtype)
Then we do slice using cpython function
pandas.core.array_algos.take._take_preprocess_indexer_and_fill_value:160
func(arr, indexer, out, fill_value)
After the func(arr, indexer, out, fill_value) call, the out array is populated with the selected elements. However, the dtype of out will not match the dtype of the elements in arr.
I tried to add a step to check and adjust the dtype of out after the func call.
# Check if the dtype of out matches the dtype of the elements in arr
if out.size > 0: # Only check if out is not empty
first_element = out.flat[0] # Get the first element
# Check if the first element's type is different from out.dtype
if isinstance(first_element, (int, float, np.number)) and out.dtype == object:
# If the first element is numeric but out.dtype is object, update the dtype
new_dtype = np.result_type(first_element)
out = out.astype(new_dtype)
This fixed the op's issue but test cases are failing. Also I feel this is not a right way to address the issue
So I'm not sure how to infer the dtype pragmatically before this for df.loc[0,[1,2]]
@rhshadrach @sanggon6107
Comment From: sanggon6107
Hi @Parthi-siva, thanks for the comment.
I had also tried similar thing, but it seems there could be side effects including test failures, since there could be many other pandas functions that call take_nd()
.
I would rather change some codes at the relatively outer level of the call stack so that we can minimize the impact.
Since it seems this issue only appears where the first axis is integer and the second one is list or slice - loc[int,list/slice]
, I think we could re-interpret the dtype of the output at the end of _LocationIndexer._getitem_lowerdim()
.
Proposed solution
@final
def _getitem_lowerdim(self, tup: tuple):
...
# This is an elided recursive call to iloc/loc
out = getattr(section, self.name)[new_key]
# Re-interpret dtype of out.values for loc/iloc[int, list/slice]. # GH60600
if i == 0 and isinstance(key, int) and isinstance(new_key, (list, slice)):
inferred_dtype = np.array(out.values.tolist()).dtype
if inferred_dtype != out.dtype:
out = out.astype(inferred_dtype)
return out
There was only one failing test when I locally ran pytest
, but the failing case should be revised according to this code change since the test is currently expecting loc[int, list] to be an object dataframe.
My concern is that, we have to create a new np.array only to re-interpret the dtype. I'm not sure if there's more elegant way to infer the output's dtype.
Please let me know what you think about the proposal. I'd be glad to co-author a commit and make a PR if you don't mind.
cc @rhshadrach
Thanks!
Comment From: parthi-siva
Hi @sanggon6107 ,
Thanks for the reply.
Pls proceed with you proposal. I'm good!
I spent some time regarding your concern about creating a np.array
just to find dtype.
can we try using np.result_type
either like this
inferred_dtype = reduce(np.result_type, out)
or like this
inferred_dtype = np.result_type(out.values.tolist())
Please let me know if it helps.
Comment From: rhshadrach
@sanggon6107 - it's not clear to me what the proposal is. Best to open a PR I think.
Comment From: sanggon6107
Hi @parthi-siva , your suggestion helped a lot!
I've also found that we could simplify the code by using infer_objects()
.
I'll make a PR based on this discussion.
Comment From: sanggon6107
take
Comment From: parthi-siva
Hi @parthi-siva , your suggestion helped a lot!
I've also found that we could simplify the code by using
infer_objects()
. I'll make a PR based on this discussion.
Sure @sanggon6107 :)