Feature Request
Hi,
I use pandas a lot in my projects and I got stack with a problem of running the "mode" function (most common element) on a huge groupby object. There are few solutions available using aggregate and scipy.stats.mode, but they are unbearably slow in comparison to e.g. groupby.mean(). Thus, I would like to make a feature request to add cytonized version of groupby.mode() operator.
Thanks in advance!
Jan Musial
Comment From: chris-b1
xref #19165 (master tracker)
A PR would be welcome! As you note the key thing would be implementing a single-pass group mode function in cython, can look at others for examples.
https://github.com/pandas-dev/pandas/blob/aa9e002463d104b6dc37cdb32b2726bafcb5aa36/pandas/_libs/groupby_helper.pxi.in#L213
Comment From: jreback
how is this different that .value_counts()
?
Comment From: j-musial
Hi Jeff,
As you can see below .value_counts() does not apply to groupby object. Furthermore, returning sorted values and counts within thousands/millions of groups gives huge overheads, whereas all you want is the most frequent value.
Best,
Jan
pd.DataFrame({'a':['a','a','a','a','b','b','b','b'],'b':[1,1,2,3,1,2,2,3]}).groupby('a').value_counts() Traceback (most recent call last): File "
", line 1, in File "/usr/lib/python2.7/dist-packages/pandas/core/groupby.py", line 529, in getattr (type(self).name, attr)) AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts' pd.version '0.19.2'
On Tue, Jan 16, 2018 at 12:35 AM, Jeff Reback notifications@github.com wrote:
how is this different that .value_counts() ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19254#issuecomment-357812690, or mute the thread https://github.com/notifications/unsubscribe-auth/AX63CvX5BIOlK47m-61a26oBEGAvrJTOks5tK-DBgaJpZM4Re3QD .
Comment From: jreback
.value_counts()
is a series method. I guess mode would simply give back the max per group.
In [32]: pd.options.display.max_rows=12
In [33]: ngroups = 100; N = 100000; np.random.seed(1234)
In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})
In [35]: df.groupby('key').value.value_counts()
Out[35]:
key value
0 5799 3
7358 3
8860 3
185 2
583 2
872 2
..
99 9904 1
9916 1
9922 1
9932 1
9935 1
9936 1
Name: value, Length: 95112, dtype: int64
Comment From: j-musial
Thanks Jeff!
Unlike .value_counts() the .mode() would act like a reduction function returning just one value. Returning entire histogram just to get the most frequent value is huge waste of computing resources.
Best,
Jan
On Tue, Jan 16, 2018 at 12:14 PM, Jeff Reback notifications@github.com wrote:
.value_counts() is a series method. I guess mode would simply give back the max per group.
In [32]: pd.options.display.max_rows=12
In [33]: ngroups = 100; N = 100000; np.random.seed(1234)
In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})
In [35]: df.groupby('key').value.value_counts() Out[35]: key value 0 5799 3 7358 3 8860 3 185 2 583 2 872 2 .. 99 9904 1 9916 1 9922 1 9932 1 9935 1 9936 1 Name: value, Length: 95112, dtype: int64
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19254#issuecomment-357929039, or mute the thread https://github.com/notifications/unsubscribe-auth/AX63CnSbVx2jD-KDExPJYtY0eLmF_Qp6ks5tLIR7gaJpZM4Re3QD .
Comment From: bfarrer
If I have a DataFrame that includes a column of cities and I want to know the most frequent city to occur on the list. I'm pretty new to python and pandas, so maybe there's an easy alternative, but I'm not aware of one.
Comment From: BrittonWinterrose
I am interested in this feature as well. +1
Comment From: rogerioluizsi
i'm too
Comment From: Arregator
I can't find a good solution for the problem of populating missing values of my dataset with a most frequent value in a group (NOTE: not mean, but most frequent).
I thought that would be way too often intention for those handling missing values.
I have a dataset:
In [425]: df
Out[425]:
brand model fuelType engineDisplacement
0 audi a100 petrol 2000.0
1 bmw 3-series diesel 2000.0
2 mercedes e-class petrol 3000.0
3 mercedes e-class petrol NaN
4 nissan leaf electro NaN
5 tesla model x electro NaN
6 mercedes e-class petrol 2000.0
7 mercedes e-class petrol 2000.0
8 mercedes e-class petrol 2000.0
and I want for my mercedes e-class petrol NaN
to fill its NaN
with 2000
as a most frequent value in the group brand-model-fuelType
. I try somthing like:
df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
lambda x: x.fillna(x.mode())
)
In [427]: df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
...: lambda x: x.fillna(x.mode())
...: )
Out[427]:
0 2000.0
1 2000.0
2 3000.0
3 NaN
4 NaN
5 NaN
6 2000.0
7 2000.0
8 2000.0
Name: engineDisplacement, dtype: float64
you see? it does not fill NaN
with 2000
as I would expect :(
Again, it is important to have a most frequent value, because in many-many cases we have to deal with a categorical values, not numeric, so we need this feature badly.
Comment From: kashumi-m
+1 for this
Comment From: lithomas1
take
Comment From: arw2019
+1 - had someone ask about this
Comment From: lithomas1
Unassigning myself as I don't have time for this.
Comment From: rhshadrach
Edit: Replaced implementation with one that is more efficient on both few categorical values (3 values, ~20% faster) and many categorical values (20k values, ~5x faster).
def gb_mode(df, keys, column):
return (
df.groupby(keys + [column], sort=False)
.size()
.sort_values(ascending=False, kind='mergesort')
.reset_index(column)
.groupby(keys)
[column]
.head(1)
.sort_index()
)
df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [3, 4, 3, 4, 3, 4]})
print(gb_mode(df, ['a'], 'b'))
produces
a
1 3
2 4
dtype: int64
Seems to have decent performance, at least when the categorical ('b', here) has few values, but still +1 on adding a cythonized mode.
Comment From: lithomas1
I think there needs to be a discussion on the API for mode before we should proceed with anything. The question being "What happens if multimodal?" (Should we raise warning, return last mode, return smallest mode?). Also, I think there may be complexity around extension types when implementing in Cython?
I think I was able to hack up something by reintroducing the tempita template for groupby, and modifying _get_cythonized_result(which iterates over columns in python?), but I'm not sure if this is the right result.
Comment From: trevor-pope
"What happens if multimodal?"
Why not all of them? It could just be an argument to the function.
keep='raise'
could raise a warning, keep='smallest'
or keep='largest'
returns the smallest/largest, etc.
Something like df.groupby('col').mode(keep='all')
will give all modes as a list (if a category is multimodal, thus making the resulting dtype object
). This might run into efficiency concerns however. I've personally had this use case (getting all modes), but I am not sure how necessary it is to support when you could get by using .value_counts()
, albeit with a bit more work and computation.
I could try to implement this, but I am not sure where to do it at. Is the tempita template still the way to go?
Comment From: wcheek
Hoping for this feature..