Pandas ENH: cumcount() for DataFrames and Series

Unless I am missing something obvious, we can do this: df.groupby('x').cumcount() but not: df.cumcount()

Obviously, there are workarounds, but seems like we should be able to do on a dataframe/series if we can do on a groupby object?

Comment From: jreback

an Index already does the

pls show a usecase

Comment From: jorisvandenbossche

@johne13 you aren't confusing it with cumsum? Besides obtaining it from an index, you can also just do range(len(df))

Comment From: johne13

It is not a super-common event, to be sure, but there are cases like this: that I bump into somewhat regularly: http://stackoverflow.com/questions/36044890/how-to-speed-up-a-very-slow-pandas-apply-function/36047836#36047836

Simpler example:

df = pd.DataFrame({ 'x':[1,2,1,2] })
#df[ df.x == 2 ].cumcount()    # doesn't work of course

df.loc[ df.x == 2, 'y'] = 1    # so I do a two step with cumsum as 2nd step
df.y.cumsum()

Out[428]: 
0    NaN
1    1.0
2    NaN
3    2.0

So yeah, lots of easy ways to do this on ENTIRE dataset but as soon as you want to do a selection, it gets a little messier. Or again, maybe I am missing something obvious.

Also, I would think whatever justification there is for having this as a groupby method applies just as much to having it work for a dataframe or series, no?

Comment From: jreback

the common idiom is

.reset_index()

Comment From: jreback

no for a groupby this is expensive if u don't have an array impl (which is why it exists)

Comment From: johne13

I guess I am missing something. I couldn't do this, for example, df['new'] = df[ df.x == 2 ].reset_index() because obviously I have reset the index so can't correctly align it with the original dataset.

I should have been more clear about that aspect. I am generally trying to create a new column when I mistakenly attempt to use the nonexistent cumcount method.

Comment From: jreback

ok not averse to adding this method for consistency (among other cum. methods) but should return a Series

pretty trivial impl if u would like to submit a PR

Comment From: jreback

signature would be identical to .cum* (at end of core/generic.py)

In [1]: pd.DataFrame.cumsum?
Signature: pd.DataFrame.cumsum(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative sum over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
sum : Series
File:      ~/pandas/pandas/core/generic.py
Type:      instancemethod

Comment From: max-sixty

Potentially not as fast (?), and handles NAs differently from the case above, but very idiomatic:

In [12]: df.where(df==2).expanding().count()
Out[12]: 
     x
0  0.0
1  1.0
2  1.0
3  2.0

Comment From: max-sixty

Taken a step further: we could offer a argument for changing the NaN behavior and offer all the cum-x behavior with .expanding()

Comment From: jreback

@MaximilianR actually that's a good point. would you make a PR to: - cross-link doc-strings for .cum* to the .expanding().* (for sum,prod,max,min, which are only supported). - if you find a nice place in computation.rst where you might want to mention these similarities. (I see you linked there), just review this then.

Comment From: jreback

maybe give a tiny example

In [7]: s = Series([1,2,np.nan,3,np.nan,4])

In [8]: s.cumsum()
Out[8]: 
0     1.0
1     3.0
2     NaN
3     6.0
4     NaN
5    10.0
dtype: float64

In [9]: s.expanding().sum()
Out[9]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

In [10]: s.cumsum().ffill()
Out[10]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

Comment From: max-sixty

@MaximilianR actually that's a good point. would you make a PR to:

I am jammed at the moment (although free enough to cruise the feed, it seems), but I can add that if no one else picks it up over the next few weeks

Comment From: jreback

ok, let me make an issue so we don't forget.

Comment From: johne13

@MaximilianR I think you can get the same behavior as in my earlier example by using a boolean mask instead of where:

In [504]: df['new'] = df.x[ df.x == 2 ].expanding().count()

In [505]: df
Out[505]: 
   x  new
0  1  NaN
1  2  1.0
2  1  NaN
3  2  2.0

Comment From: johne13

Here's a quick speed test, cumsum about 4x faster than expanding().count()

In [515]: df = pd.DataFrame({ 'x':np.random.randn(10000) })

In [516]: %timeit df['new1'] = df.x[ df.x > 0 ].expanding().count()
100 loops, best of 3: 4.57 ms per loop

In [517]: %%timeit
     ...: df.loc[ df.x > 0, 'new2' ] = 1
     ...: df['new2'] = df['new2'].cumsum()
1000 loops, best of 3: 1.12 ms per loop

Comment From: jreback

try assigning to a range ITS much faster still. this is the impl

we were not suggesting using expanding.count here it's doing a lot more work and very careful about min periods and such

Comment From: johne13

@jreback So do you still think cumcount is worthwhile or is expanding().count() the way to go?

Sorry, don't understand what you mean by "assigning to a range"?

If you still like the cumcount idea I'll give it a shot though I have not done a PR before. Of course if someone else wants to do it I'd gladly defer to them.

Comment From: jreback

I think it rounds out the cum functions a bit and would be ok

Comment From: johne13

OK, I think I have an initial attempt and will post the code and example in a moment, but first want to list other similar functions as a point of comparison: - cumsum() -- outputs series or dataframe, dtype=float - (groupby)cumcount() -- outputs a series, numbered 0,1,2, ... n-1, dtype=int - expanding().count() -- outputs series or dataframe, numbered 1,2,3, ... n, dtype=float

Starting from there, it made the most sense to me to aim for something roughly consistent with cumsum() and expanding().count and not (groupby)cumcount(). Example to follow, see what you think.

Comment From: johne13

Here's the code, added to generic, just after cummax (and also cumsum for comparison).

        cls.cumcount = _make_cum_function(
            'count', name, name2, axis_descr, "cumulative count",
            lambda y, axis: np.cumsum(~np.isnan(y), axis).astype(float), np.nan, np.nan)

        cls.cumsum = _make_cum_function(
            'sum', name, name2, axis_descr, "cumulative sum",
            lambda y, axis: y.cumsum(axis), 0., np.nan)

It's a little less natural than the other cum functions, but that was the best way I could come up with that fit the existing _make_cum_function() infrastructure. It's basically cumsum but needs extra code to essentially convert non-nan to ones while maintaining the same treatment of nans as cumsum.

In some quickie timings (1000 rows and 2 columns) it is just a little slower than cumsum() and much faster than expanding().count(). Exactly as expected, I think.

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4)  })

df.cumsum()
Out[18]: 
     x    y
0  1.0  0.0
1  2.0  1.0
2  NaN  3.0
3  3.0  6.0

df.cumcount()
Out[19]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  NaN  3.0
3  3.0  4.0

df.expanding().count()
Out[20]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  2.0  3.0
3  3.0  4.0

And here's groupby/cumcount for comparison:

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4), 'grp':[1,1,1,1] })

df.groupby('grp').cumcount()
Out[40]: 
0    0
1    1
2    2
3    3
dtype: int64

Comment From: jorisvandenbossche

The problem with making it consistent with groupby().cumcount() is that that function does not take the values into accound (whether they are NaN or not), it just numbers the entries in the group regardless of its values. That is also the reason that it returns a Series.

So if we want to make Series/DataFrame.cumcount consistent with the existing method, then it should also not disregard NaN values (as you do in your example implementation). But, I suppose this defeats part of the reason you want a Series/DataFrame.cumcount?

Comment From: johne13

@jorisvandenbossche I really hadn't thought about how exactly a Series/DataFrame.cumcount would work until I started comparing it to all the existing functions (count/cumsum/groupby.cumcount). Then I realized it couldn't be consistent with all of them, only some.

Out of all the comparison functions, it seems to me that it ought to work approximately the same as expanding().count(), right? And conversely, the way groupby.cumcount works makes the least sense to me, but of course I'll gladly defer to you and others on how to best make this consistent with the rest of pandas.

Comment From: jreback

This is almost a trivial function, should be something like:

def cumcount(self, axis=0, skipna=True):
   ax = self._get_axis(axis)
   if skipna and ax.hasnans:
         result = ax.isnull().cumsum()
   else:
         result = range(len(ax))
    return Series(result, index=ax)

this is really just a range; if the index has nans then its slightly different.

Comment From: jreback

I guess it technically should have an ascending arg to be consistent. Further .groupby.cumcount() should have skipna.

This should mirror the .cum* signatures.

Comment From: jorisvandenbossche

@jreback But I would also find it very unfortunate that groupby().cumcount() and groupby().apply(lambda x: x.cumcount()) would not be the same .. And it is, as far as I can see, not possible to have both that and have it consistent with Series.cumsum and others

Comment From: jorisvandenbossche

Actually, to phrase the question in another way: should cumcount work on the index or on the values?

The example implementation of @johne13 works on the values (ignoring NaNs there), while the example implementation of @jreback works on the index.

Comment From: jreback

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

Comment From: johne13

Sorry, some of these details are over my head -- have never really delved into pandas internals before. I tried to make it work as similarly to cum* as possible, that's why it is done via _make_cum_function like the other cum functions rather than a standalone. Signature is comparable to cumsum.

In [28]: pd.DataFrame.cumcount?
Signature: pd.DataFrame.cumcount(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative count over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
count : Series
File:      ~/pandas-johne13/pandas/core/generic.py
Type:      function

Comment From: johne13

And also regarding groupby.cumcount(), it actually doesn't agree with groupby.count() in that its ending value is n-1 whereas count returns n.

Comment From: jreback

@johne13 no .count() is the number of non-nans, whereas .cumcount() is an enumeration.

Comment From: jorisvandenbossche

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

And that's the problem, as for groupby().cumcount, it works on the index (actually, it is not really defined there, but since it gives back a series regardless of how many columns you have, it is like it would work on the index).

So it that sense, groupby().cumcount is not really a cumulative count, but rather a cumulative size .. (like you have both groupby().count and groupby().size)

Comment From: johne13

Right, I'm just trying to note what seems consistent and what doesn't. Along those lines, note that the following holds true for the relationship between sum and cumsum

df.cumsum()[-1:] == df.sum()

The same relationship holds for prod/cumprod, max/cummax, and min/cummin but not for the groupby versions of count/cumcount. So out of all these relationships, groupby().cumcount seems like the one that is somewhat inconsistent with other pandas functions.

Comment From: johne13

Yeah, as Joris notes, groupby.cumcount seems more analogous to groupby.size than to groupby.count except for the n vs n-1 aspect.

Comment From: johne13

@jreback @jorisvandenbossche FYI, I will abandon this for now, pending further instructions. I'm not quite sure what if anything you guys decided about how best to proceed. And this might not be a good first project for someone who has never done a PR and is not super familiar with all the underlying code, so if someone else want to do something with it I'm happy to step aside.

Comment From: johne13

One other point of comparison for a hypothetical cumcount() is rank(method='first') on an array of ones. Just playing around with this a little bit it seems to do approximately the same thing as expanding().count() but is much faster -- actually faster than a cumsum() in my test example but I imagine that doesn't hold in all cases since adding must generally be faster than sorting (?).

Comment From: jreback

@johne13 almost everything is faster than sorting :) sorting O(n log n); people go to great length to avoid sort.

Comment From: johne13

Right, except in this case it's "sorting" a column of ones, so it's just a matter of how long it takes the algorithm to confirm that it's already sorted, there is no actual sorting.

But I mainly mentioned rank() just as another function to consider in terms of consistency with how other existing functions behave.

Comment From: jreback

So we have very fast routines to determine whether something is monotonic which determine if its 'sorted' very quickly.

In [2]: Index([1,2,3]).is_monotonic
Out[2]: True

In [3]: Index([3,2,1]).is_monotonic
Out[3]: False