Pandas ENH: Enable nsmallest/nlargest on object dtype

Is it possible to implement method DataFrame.nsorted that equivalent to df.sort_values(by=["col1", "col2"], ascending=[True, False]).head(k), but with time complexity O(n log k)? I know about nlargest and nsmallest methods but unfortunately they work only with numeric columns.

Example:

df.nsorted(k=6, by=["col1", "col2"], ascending=[True, False])

Comment From: rhshadrach

Thanks for the request. I see no reason nlargest and nsmallest cannot be made to work on object dtype. We'd need to add in a Cython implementation of nlargest as currently only nsmallest is implemented and we apply this to -values which will not work with object dtype. Certainly this would be preferred over adding a new method that does the same thing.

Further investigations and PRs to implement are welcome!

Comment From: Gri72

Thank you for the reply. As far as I understood correctly, in the end it might look something like this?

df.nlargest(n=6, columns=['col1', 'col2'], ascending=[True, False])

Comment From: rhshadrach

I do not think we should add an ascending argument; this can be accomplished by sorting the result.

Comment From: RutujaGhodake

take

Comment From: MartinBraquet

@rhshadrach I'd like to push back a little bit to make sure the issue is clearly understood, as it appears to me that the request extends beyond simply enabling dtype objects in nsmallest/nlargest.

Indeed, I believe there is a second ENH request which solely concerns sorting, irrespective of the type (that is, it also concerns numeric values), which I'll try to explain below.

What is currently missing in those nlargest and nsmallest methods is a way to sort values where the order is ascending for one column and descending for another column. One can see it as a combination of nlargest and smallest, depending on the column, which is what is being proposed by OP by a new nsorted method.

Here is an example that, I believe, can only be achieved via sort_values, which OP would like to get more efficiently, and which none of nlargest or smallest can perform, irrespective of a post-hoc sorting.

import pandas as pd

df = pd.DataFrame({
    'c1': [1, 2, 2],
    'c2': [1, 2, 3],
})

df.sort_values(['c1', 'c2'], ascending=[True, False]).head(2)

   c1  c2
0   1   1
2   2   3

For instance, nmallest returns a different result, which irreversibly (regardless of subsequent operations) prevents the user from obtaining the desired result above:

df.nsmallest(2, ['c1', 'c2'])

   c1  c2
0   1   1
1   2   2

Naturally, most users have managed their way through this issue by negating the columns for which they want nsmallest and then using nlargest on that modified dataframe; but this hack only works for numeric types, which, I assume, is the reason for the OP posting this ENH request.

Please let me know if I missed something.

If not, perhaps we should indeed have that nsorted method, with its ascending parameter, which would not only allow for dtypes but also have a cleaner code on the user end (without ad hoc negation of some columns).