Unless I am missing something obvious, we can do this:
df.groupby('x').cumcount()
but not:
df.cumcount()
Obviously, there are workarounds, but seems like we should be able to do on a dataframe/series if we can do on a groupby object?
Comment From: jreback
an Index already does the
pls show a usecase
Comment From: jorisvandenbossche
@johne13 you aren't confusing it with cumsum
? Besides obtaining it from an index, you can also just do range(len(df))
Comment From: johne13
It is not a super-common event, to be sure, but there are cases like this: that I bump into somewhat regularly: http://stackoverflow.com/questions/36044890/how-to-speed-up-a-very-slow-pandas-apply-function/36047836#36047836
Simpler example:
df = pd.DataFrame({ 'x':[1,2,1,2] })
#df[ df.x == 2 ].cumcount() # doesn't work of course
df.loc[ df.x == 2, 'y'] = 1 # so I do a two step with cumsum as 2nd step
df.y.cumsum()
Out[428]:
0 NaN
1 1.0
2 NaN
3 2.0
So yeah, lots of easy ways to do this on ENTIRE dataset but as soon as you want to do a selection, it gets a little messier. Or again, maybe I am missing something obvious.
Also, I would think whatever justification there is for having this as a groupby method applies just as much to having it work for a dataframe or series, no?
Comment From: jreback
the common idiom is
.reset_index()
Comment From: jreback
no for a groupby this is expensive if u don't have an array impl (which is why it exists)
Comment From: johne13
I guess I am missing something. I couldn't do this, for example,
df['new'] = df[ df.x == 2 ].reset_index()
because obviously I have reset the index so can't correctly align it with the original dataset.
I should have been more clear about that aspect. I am generally trying to create a new column when I mistakenly attempt to use the nonexistent cumcount method.
Comment From: jreback
ok not averse to adding this method for consistency (among other cum. methods) but should return a Series
pretty trivial impl if u would like to submit a PR
Comment From: jreback
signature would be identical to .cum*
(at end of core/generic.py
)
In [1]: pd.DataFrame.cumsum?
Signature: pd.DataFrame.cumsum(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative sum over requested axis.
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
Returns
-------
sum : Series
File: ~/pandas/pandas/core/generic.py
Type: instancemethod
Comment From: max-sixty
Potentially not as fast (?), and handles NAs differently from the case above, but very idiomatic:
In [12]: df.where(df==2).expanding().count()
Out[12]:
x
0 0.0
1 1.0
2 1.0
3 2.0
Comment From: max-sixty
Taken a step further: we could offer a argument for changing the NaN behavior and offer all the cum-x
behavior with .expanding()
Comment From: jreback
@MaximilianR actually that's a good point. would you make a PR to:
- cross-link doc-strings for .cum*
to the .expanding().*
(for sum,prod,max,min
, which are only supported).
- if you find a nice place in computation.rst where you might want to mention these similarities. (I see you linked there), just review this then.
Comment From: jreback
maybe give a tiny example
In [7]: s = Series([1,2,np.nan,3,np.nan,4])
In [8]: s.cumsum()
Out[8]:
0 1.0
1 3.0
2 NaN
3 6.0
4 NaN
5 10.0
dtype: float64
In [9]: s.expanding().sum()
Out[9]:
0 1.0
1 3.0
2 3.0
3 6.0
4 6.0
5 10.0
dtype: float64
In [10]: s.cumsum().ffill()
Out[10]:
0 1.0
1 3.0
2 3.0
3 6.0
4 6.0
5 10.0
dtype: float64
Comment From: max-sixty
@MaximilianR actually that's a good point. would you make a PR to:
I am jammed at the moment (although free enough to cruise the feed, it seems), but I can add that if no one else picks it up over the next few weeks
Comment From: jreback
ok, let me make an issue so we don't forget.
Comment From: johne13
@MaximilianR I think you can get the same behavior as in my earlier example by using a boolean mask instead of where
:
In [504]: df['new'] = df.x[ df.x == 2 ].expanding().count()
In [505]: df
Out[505]:
x new
0 1 NaN
1 2 1.0
2 1 NaN
3 2 2.0
Comment From: johne13
Here's a quick speed test, cumsum
about 4x faster than expanding().count()
In [515]: df = pd.DataFrame({ 'x':np.random.randn(10000) })
In [516]: %timeit df['new1'] = df.x[ df.x > 0 ].expanding().count()
100 loops, best of 3: 4.57 ms per loop
In [517]: %%timeit
...: df.loc[ df.x > 0, 'new2' ] = 1
...: df['new2'] = df['new2'].cumsum()
1000 loops, best of 3: 1.12 ms per loop
Comment From: jreback
try assigning to a range ITS much faster still. this is the impl
we were not suggesting using expanding.count here it's doing a lot more work and very careful about min periods and such
Comment From: johne13
@jreback So do you still think cumcount
is worthwhile or is expanding().count()
the way to go?
Sorry, don't understand what you mean by "assigning to a range"?
If you still like the cumcount idea I'll give it a shot though I have not done a PR before. Of course if someone else wants to do it I'd gladly defer to them.
Comment From: jreback
I think it rounds out the cum functions a bit and would be ok
Comment From: johne13
OK, I think I have an initial attempt and will post the code and example in a moment, but first want to list other similar functions as a point of comparison:
- cumsum()
-- outputs series or dataframe, dtype=float
- (groupby)cumcount()
-- outputs a series, numbered 0,1,2, ... n-1, dtype=int
- expanding().count()
-- outputs series or dataframe, numbered 1,2,3, ... n, dtype=float
Starting from there, it made the most sense to me to aim for something roughly consistent with cumsum() and expanding().count and not (groupby)cumcount(). Example to follow, see what you think.
Comment From: johne13
Here's the code, added to generic, just after cummax (and also cumsum for comparison).
cls.cumcount = _make_cum_function(
'count', name, name2, axis_descr, "cumulative count",
lambda y, axis: np.cumsum(~np.isnan(y), axis).astype(float), np.nan, np.nan)
cls.cumsum = _make_cum_function(
'sum', name, name2, axis_descr, "cumulative sum",
lambda y, axis: y.cumsum(axis), 0., np.nan)
It's a little less natural than the other cum functions, but that was the best way I could come up with that fit the existing _make_cum_function()
infrastructure. It's basically cumsum but needs extra code to essentially convert non-nan to ones while maintaining the same treatment of nans as cumsum.
In some quickie timings (1000 rows and 2 columns) it is just a little slower than cumsum() and much faster than expanding().count(). Exactly as expected, I think.
df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4) })
df.cumsum()
Out[18]:
x y
0 1.0 0.0
1 2.0 1.0
2 NaN 3.0
3 3.0 6.0
df.cumcount()
Out[19]:
x y
0 1.0 1.0
1 2.0 2.0
2 NaN 3.0
3 3.0 4.0
df.expanding().count()
Out[20]:
x y
0 1.0 1.0
1 2.0 2.0
2 2.0 3.0
3 3.0 4.0
And here's groupby/cumcount for comparison:
df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4), 'grp':[1,1,1,1] })
df.groupby('grp').cumcount()
Out[40]:
0 0
1 1
2 2
3 3
dtype: int64
Comment From: jorisvandenbossche
The problem with making it consistent with groupby().cumcount()
is that that function does not take the values into accound (whether they are NaN or not), it just numbers the entries in the group regardless of its values. That is also the reason that it returns a Series.
So if we want to make Series/DataFrame.cumcount
consistent with the existing method, then it should also not disregard NaN values (as you do in your example implementation).
But, I suppose this defeats part of the reason you want a Series/DataFrame.cumcount
?
Comment From: johne13
@jorisvandenbossche I really hadn't thought about how exactly a Series/DataFrame.cumcount would work until I started comparing it to all the existing functions (count/cumsum/groupby.cumcount). Then I realized it couldn't be consistent with all of them, only some.
Out of all the comparison functions, it seems to me that it ought to work approximately the same as expanding().count(), right? And conversely, the way groupby.cumcount works makes the least sense to me, but of course I'll gladly defer to you and others on how to best make this consistent with the rest of pandas.
Comment From: jreback
This is almost a trivial function, should be something like:
def cumcount(self, axis=0, skipna=True):
ax = self._get_axis(axis)
if skipna and ax.hasnans:
result = ax.isnull().cumsum()
else:
result = range(len(ax))
return Series(result, index=ax)
this is really just a range; if the index has nans then its slightly different.
Comment From: jreback
I guess it technically should have an ascending
arg to be consistent. Further .groupby.cumcount()
should have skipna
.
This should mirror the .cum*
signatures.
Comment From: jorisvandenbossche
@jreback But I would also find it very unfortunate that groupby().cumcount()
and groupby().apply(lambda x: x.cumcount())
would not be the same ..
And it is, as far as I can see, not possible to have both that and have it consistent with Series.cumsum
and others
Comment From: jorisvandenbossche
Actually, to phrase the question in another way: should cumcount
work on the index or on the values?
The example implementation of @johne13 works on the values (ignoring NaNs there), while the example implementation of @jreback works on the index.
Comment From: jreback
@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*
.
Comment From: johne13
Sorry, some of these details are over my head -- have never really delved into pandas internals before. I tried to make it work as similarly to cum* as possible, that's why it is done via _make_cum_function
like the other cum functions rather than a standalone. Signature is comparable to cumsum.
In [28]: pd.DataFrame.cumcount?
Signature: pd.DataFrame.cumcount(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative count over requested axis.
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
Returns
-------
count : Series
File: ~/pandas-johne13/pandas/core/generic.py
Type: function
Comment From: johne13
And also regarding groupby.cumcount()
, it actually doesn't agree with groupby.count()
in that its ending value is n-1 whereas count returns n.
Comment From: jreback
@johne13 no .count()
is the number of non-nans, whereas .cumcount()
is an enumeration.
Comment From: jorisvandenbossche
@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.
And that's the problem, as for groupby().cumcount
, it works on the index (actually, it is not really defined there, but since it gives back a series regardless of how many columns you have, it is like it would work on the index).
So it that sense, groupby().cumcount
is not really a cumulative count
, but rather a cumulative size
.. (like you have both groupby().count
and groupby().size
)
Comment From: johne13
Right, I'm just trying to note what seems consistent and what doesn't. Along those lines, note that the following holds true for the relationship between sum
and cumsum
df.cumsum()[-1:] == df.sum()
The same relationship holds for prod/cumprod
, max/cummax
, and min/cummin
but not for the groupby versions of count/cumcount
. So out of all these relationships, groupby().cumcount
seems like the one that is somewhat inconsistent with other pandas functions.
Comment From: johne13
Yeah, as Joris notes, groupby.cumcount
seems more analogous to groupby.size
than to groupby.count
except for the n vs n-1 aspect.
Comment From: johne13
@jreback @jorisvandenbossche FYI, I will abandon this for now, pending further instructions. I'm not quite sure what if anything you guys decided about how best to proceed. And this might not be a good first project for someone who has never done a PR and is not super familiar with all the underlying code, so if someone else want to do something with it I'm happy to step aside.
Comment From: johne13
One other point of comparison for a hypothetical cumcount()
is rank(method='first')
on an array of ones. Just playing around with this a little bit it seems to do approximately the same thing as expanding().count()
but is much faster -- actually faster than a cumsum()
in my test example but I imagine that doesn't hold in all cases since adding must generally be faster than sorting (?).
Comment From: jreback
@johne13 almost everything is faster than sorting :) sorting O(n log n)
; people go to great length to avoid sort.
Comment From: johne13
Right, except in this case it's "sorting" a column of ones, so it's just a matter of how long it takes the algorithm to confirm that it's already sorted, there is no actual sorting.
But I mainly mentioned rank()
just as another function to consider in terms of consistency with how other existing functions behave.
Comment From: jreback
So we have very fast routines to determine whether something is monotonic which determine if its 'sorted' very quickly.
In [2]: Index([1,2,3]).is_monotonic
Out[2]: True
In [3]: Index([3,2,1]).is_monotonic
Out[3]: False