Pandas issue when shifting with Timedelta in a groupby

Hello the awesome Pandas team!

Consider the example below

data = pd.DataFrame({'mydate' : [pd.to_datetime('2016-06-06'),
                                 pd.to_datetime('2016-06-08'),
                                 pd.to_datetime('2016-06-09'),
                                 pd.to_datetime('2016-06-10'),
                                 pd.to_datetime('2016-06-12'),
                                 pd.to_datetime('2016-06-13')],
                     'myvalue' : [1, 2, 3, 4, 5, 6],
                     'group' : ['A', 'A', 'A', 'B', 'B', 'B']})

data.set_index('mydate', inplace = True)
Out[58]: 
           group  myvalue
mydate                   
2016-06-06     A        1
2016-06-08     A        2
2016-06-09     A        3
2016-06-10     B        4
2016-06-12     B        5
2016-06-13     B        6

Now I need to compute the difference between the current value of myvalue and its lagged value, where by lagged I actually mean lagged by 1 day (if possible).

So this returns a result, but its not what I need

data['delta_one'] = data.groupby('group').myvalue.transform(lambda x: x - x.shift(1))

data
Out[56]: 
           group  myvalue  delta_one
mydate                              
2016-06-06     A        1        nan
2016-06-08     A        2     1.0000
2016-06-09     A        3     1.0000
2016-06-10     B        4        nan
2016-06-12     B        5     1.0000
2016-06-13     B        6     1.0000

This is what I need, but it does not work

data['delta_two'] = data.groupby('group').myvalue.transform(lambda x: x - x.shift(1, pd.Timedelta('1 days')))

  File "C:\Users\john\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 120, in __init__
    len(self.mgr_locs)))

ValueError: Wrong number of items passed 4, placement implies 3

Any ideas? Is this a bug? I think I am using the correct pandonic syntax here.

Thanks!

Comment From: chris-b1

I think apply works here?

In [14]: data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))
Out[14]: 
group  mydate    
A      2016-06-06    NaN
       2016-06-07    NaN
       2016-06-08    NaN
       2016-06-09    1.0
       2016-06-10    NaN
B      2016-06-10    NaN
       2016-06-11    NaN
       2016-06-12    NaN
       2016-06-13    1.0
       2016-06-14    NaN
Name: myvalue, dtype: float64

Comment From: randomgambit

@chris-b1 thanks yes, I saw that but you cannot directly assign this data to the dataframe

data['delta_two'] = data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))

  File "C:\Users\john\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2734, in reindexer
    raise TypeError('incompatible index of inserted column '

TypeError: incompatible index of inserted column with frame index

Comment From: mroeschke

You can do some reshaping and remerge the result of the groupby.apply to your original data

In [5]: res = data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))

In [6]: res.name = 'delta_one'

In [7]: data.reset_index().merge(res.reset_index(), how='left', on=['mydate', 'group']).set_index('mydate')
Out[7]:
           group  myvalue  delta_one
mydate
2016-06-06     A        1        NaN
2016-06-08     A        2        NaN
2016-06-09     A        3        1.0
2016-06-10     B        4        NaN
2016-06-12     B        5        NaN
2016-06-13     B        6        1.0

Comment From: randomgambit

@mroeschke thanks buddy, yes I ended up doing something very much like that :) I wonder if this behavior is expected or not

Comment From: WillAyd

My $.02 is that if you wanted dates to be sequential by day then you should be reindexing and filling the group before any groupby. So something along the lines of:

>>> dt_rng = pd.date_range(data.index.min(), data.index.max())
>>> data = data.reindex(rng)
>>> data['group'] = data['group'].ffill()
>>> data.groupby('group')['myvalue'].transform(lambda x: x-x.shift())
2016-06-06    NaN
2016-06-07    NaN
2016-06-08    NaN
2016-06-09    1.0
2016-06-10    NaN
2016-06-11    NaN
2016-06-12    NaN
2016-06-13    1.0
Freq: D, Name: myvalue, dtype: float64

Probably a little more verbose than you want to be and gives a different result than what you have, but seems the most "pandonic" to me

Comment From: jreback

lots of good answers here. .shift() does care whether an index is complete or not, just that it is ordered. You must resample / reindex-fill yourself.

I suppose adding this example to the doc-string (or maybe in the docs might be easier) could be informative. Going to repurpose this issue for that enhancement.

Comment From: randomgambit

Hi @jeffreback thanks!

but what do you mean by shift does not care whether and index is ordered or not? Using timedelta works with shift if I am not mistaken.

essentially, I ended up writing a function that is very similar to what @chris-b1 is doing. I hope all is good

Comment From: jreback

@randomgambit

I said

does care whether an index is complete or not, just that it is ordered. You must resample / reindex-fill yourself.

it doesn't care whether its complete, just ordered.

Comment From: PyJay

I will add this example to the documentation

Comment From: PyJay

PR here - https://github.com/pandas-dev/pandas/pull/21039

Comment From: randomgambit

Hello guys, as the original OP I would be happy to contribute here!

From what I see above, you want to get rid of the pd.Timedelta shift? But that is the essence of the question! and the solution proposed by @mroeschke in #20492 works like a charm!

Here is a variant

df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12'),
                             pd.to_datetime('2018-01-01 14:34:13'),
                             pd.to_datetime('2018-01-01 14:34:15'),
                             pd.to_datetime('2018-01-01 14:34:16'),
                             pd.to_datetime('2018-01-01 14:34:28')],
                    'myvalue' : [1,2,np.NaN,3,1],
                    'mygroup' : ['a','a','a','b','b']})

df.set_index('mytime', inplace = True)

shifted_df =  df.groupby('mygroup').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 sec')))
shifted_df.name = 'shifted'
df = df.reset_index().merge(shifted_df.reset_index(), on = ['mytime','mygroup'], how = 'left')

df
Out[108]: 
               mytime mygroup  myvalue  shifted
0 2018-01-01 14:34:12       a  1.00000      nan
1 2018-01-01 14:34:13       a  2.00000  1.00000
2 2018-01-01 14:34:15       a      nan      nan
3 2018-01-01 14:34:16       b  3.00000      nan
4 2018-01-01 14:34:28       b  1.00000      nan

Why dont you just use that in the docs? Pretty nice IMHO

Comment From: jbrockmendel

@randomgambit the OP example now doesn't raise, can you try it and see if the result it gives is what you expect