Feature Request

Hi,

I use pandas a lot in my projects and I got stack with a problem of running the "mode" function (most common element) on a huge groupby object. There are few solutions available using aggregate and scipy.stats.mode, but they are unbearably slow in comparison to e.g. groupby.mean(). Thus, I would like to make a feature request to add cytonized version of groupby.mode() operator.

Thanks in advance!

Jan Musial

Comment From: chris-b1

xref #19165 (master tracker)

A PR would be welcome! As you note the key thing would be implementing a single-pass group mode function in cython, can look at others for examples.

https://github.com/pandas-dev/pandas/blob/aa9e002463d104b6dc37cdb32b2726bafcb5aa36/pandas/_libs/groupby_helper.pxi.in#L213

Comment From: jreback

how is this different that .value_counts() ?

Comment From: j-musial

Hi Jeff,

As you can see below .value_counts() does not apply to groupby object. Furthermore, returning sorted values and counts within thousands/millions of groups gives huge overheads, whereas all you want is the most frequent value.

Best,

Jan

pd.DataFrame({'a':['a','a','a','a','b','b','b','b'],'b':[1,1,2,3,1,2,2,3]}).groupby('a').value_counts() Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/dist-packages/pandas/core/groupby.py", line 529, in getattr (type(self).name, attr)) AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts' pd.version '0.19.2'

On Tue, Jan 16, 2018 at 12:35 AM, Jeff Reback notifications@github.com wrote:

how is this different that .value_counts() ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19254#issuecomment-357812690, or mute the thread https://github.com/notifications/unsubscribe-auth/AX63CvX5BIOlK47m-61a26oBEGAvrJTOks5tK-DBgaJpZM4Re3QD .

Comment From: jreback

.value_counts() is a series method. I guess mode would simply give back the max per group.

In [32]: pd.options.display.max_rows=12

In [33]: ngroups = 100; N = 100000; np.random.seed(1234)

In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})

In [35]: df.groupby('key').value.value_counts()
Out[35]: 
key  value
0    5799     3
     7358     3
     8860     3
     185      2
     583      2
     872      2
             ..
99   9904     1
     9916     1
     9922     1
     9932     1
     9935     1
     9936     1
Name: value, Length: 95112, dtype: int64

Comment From: j-musial

Thanks Jeff!

Unlike .value_counts() the .mode() would act like a reduction function returning just one value. Returning entire histogram just to get the most frequent value is huge waste of computing resources.

Best,

Jan

On Tue, Jan 16, 2018 at 12:14 PM, Jeff Reback notifications@github.com wrote:

.value_counts() is a series method. I guess mode would simply give back the max per group.

In [32]: pd.options.display.max_rows=12

In [33]: ngroups = 100; N = 100000; np.random.seed(1234)

In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})

In [35]: df.groupby('key').value.value_counts() Out[35]: key value 0 5799 3 7358 3 8860 3 185 2 583 2 872 2 .. 99 9904 1 9916 1 9922 1 9932 1 9935 1 9936 1 Name: value, Length: 95112, dtype: int64

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19254#issuecomment-357929039, or mute the thread https://github.com/notifications/unsubscribe-auth/AX63CnSbVx2jD-KDExPJYtY0eLmF_Qp6ks5tLIR7gaJpZM4Re3QD .

Comment From: bfarrer

If I have a DataFrame that includes a column of cities and I want to know the most frequent city to occur on the list. I'm pretty new to python and pandas, so maybe there's an easy alternative, but I'm not aware of one.

Comment From: BrittonWinterrose

I am interested in this feature as well. +1

Comment From: rogerioluizsi

i'm too

Comment From: Arregator

I can't find a good solution for the problem of populating missing values of my dataset with a most frequent value in a group (NOTE: not mean, but most frequent).

I thought that would be way too often intention for those handling missing values.

I have a dataset:

In [425]: df
Out[425]:
      brand     model fuelType  engineDisplacement
0      audi      a100   petrol              2000.0
1       bmw  3-series   diesel              2000.0
2  mercedes   e-class   petrol              3000.0
3  mercedes   e-class   petrol                 NaN
4    nissan      leaf  electro                 NaN
5     tesla   model x  electro                 NaN
6  mercedes   e-class   petrol              2000.0
7  mercedes   e-class   petrol              2000.0
8  mercedes   e-class   petrol              2000.0

and I want for my mercedes e-class petrol NaN to fill its NaN with 2000 as a most frequent value in the group brand-model-fuelType. I try somthing like:

df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
    lambda x: x.fillna(x.mode())
)

In [427]: df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
     ...:     lambda x: x.fillna(x.mode())
     ...:     )
Out[427]:
0    2000.0
1    2000.0
2    3000.0
3       NaN
4       NaN
5       NaN
6    2000.0
7    2000.0
8    2000.0
Name: engineDisplacement, dtype: float64

you see? it does not fill NaN with 2000 as I would expect :(

Again, it is important to have a most frequent value, because in many-many cases we have to deal with a categorical values, not numeric, so we need this feature badly.

Comment From: kashumi-m

+1 for this

Comment From: lithomas1

take

Comment From: arw2019

+1 - had someone ask about this

Comment From: lithomas1

Unassigning myself as I don't have time for this.

Comment From: rhshadrach

Edit: Replaced implementation with one that is more efficient on both few categorical values (3 values, ~20% faster) and many categorical values (20k values, ~5x faster).

def gb_mode(df, keys, column):
    return (
        df.groupby(keys + [column], sort=False)
        .size()
        .sort_values(ascending=False, kind='mergesort')
        .reset_index(column)
        .groupby(keys)
        [column]
        .head(1)
        .sort_index()
    )

df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [3, 4, 3, 4, 3, 4]})
print(gb_mode(df, ['a'], 'b'))

produces

a
1    3
2    4
dtype: int64

Seems to have decent performance, at least when the categorical ('b', here) has few values, but still +1 on adding a cythonized mode.

Comment From: lithomas1

I think there needs to be a discussion on the API for mode before we should proceed with anything. The question being "What happens if multimodal?" (Should we raise warning, return last mode, return smallest mode?). Also, I think there may be complexity around extension types when implementing in Cython?

I think I was able to hack up something by reintroducing the tempita template for groupby, and modifying _get_cythonized_result(which iterates over columns in python?), but I'm not sure if this is the right result.

Comment From: trevor-pope

"What happens if multimodal?"

Why not all of them? It could just be an argument to the function.

keep='raise' could raise a warning, keep='smallest' or keep='largest' returns the smallest/largest, etc.

Something like df.groupby('col').mode(keep='all') will give all modes as a list (if a category is multimodal, thus making the resulting dtype object). This might run into efficiency concerns however. I've personally had this use case (getting all modes), but I am not sure how necessary it is to support when you could get by using .value_counts(), albeit with a bit more work and computation.

I could try to implement this, but I am not sure where to do it at. Is the tempita template still the way to go?

Comment From: wcheek

Hoping for this feature..