Hello the awesome Pandas team!
Consider the example below
data = pd.DataFrame({'mydate' : [pd.to_datetime('2016-06-06'),
pd.to_datetime('2016-06-08'),
pd.to_datetime('2016-06-09'),
pd.to_datetime('2016-06-10'),
pd.to_datetime('2016-06-12'),
pd.to_datetime('2016-06-13')],
'myvalue' : [1, 2, 3, 4, 5, 6],
'group' : ['A', 'A', 'A', 'B', 'B', 'B']})
data.set_index('mydate', inplace = True)
Out[58]:
group myvalue
mydate
2016-06-06 A 1
2016-06-08 A 2
2016-06-09 A 3
2016-06-10 B 4
2016-06-12 B 5
2016-06-13 B 6
Now I need to compute the difference between the current value of myvalue
and its lagged value, where by lagged I actually mean lagged by 1 day (if possible).
So this returns a result, but its not what I need
data['delta_one'] = data.groupby('group').myvalue.transform(lambda x: x - x.shift(1))
data
Out[56]:
group myvalue delta_one
mydate
2016-06-06 A 1 nan
2016-06-08 A 2 1.0000
2016-06-09 A 3 1.0000
2016-06-10 B 4 nan
2016-06-12 B 5 1.0000
2016-06-13 B 6 1.0000
This is what I need, but it does not work
data['delta_two'] = data.groupby('group').myvalue.transform(lambda x: x - x.shift(1, pd.Timedelta('1 days')))
File "C:\Users\john\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 120, in __init__
len(self.mgr_locs)))
ValueError: Wrong number of items passed 4, placement implies 3
Any ideas? Is this a bug? I think I am using the correct pandonic syntax here.
Thanks!
Comment From: chris-b1
I think apply
works here?
In [14]: data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))
Out[14]:
group mydate
A 2016-06-06 NaN
2016-06-07 NaN
2016-06-08 NaN
2016-06-09 1.0
2016-06-10 NaN
B 2016-06-10 NaN
2016-06-11 NaN
2016-06-12 NaN
2016-06-13 1.0
2016-06-14 NaN
Name: myvalue, dtype: float64
Comment From: randomgambit
@chris-b1 thanks yes, I saw that but you cannot directly assign this data to the dataframe
data['delta_two'] = data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))
File "C:\Users\john\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2734, in reindexer
raise TypeError('incompatible index of inserted column '
TypeError: incompatible index of inserted column with frame index
Comment From: mroeschke
You can do some reshaping and remerge the result of the groupby.apply
to your original data
In [5]: res = data.groupby('group').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 days')))
In [6]: res.name = 'delta_one'
In [7]: data.reset_index().merge(res.reset_index(), how='left', on=['mydate', 'group']).set_index('mydate')
Out[7]:
group myvalue delta_one
mydate
2016-06-06 A 1 NaN
2016-06-08 A 2 NaN
2016-06-09 A 3 1.0
2016-06-10 B 4 NaN
2016-06-12 B 5 NaN
2016-06-13 B 6 1.0
Comment From: randomgambit
@mroeschke thanks buddy, yes I ended up doing something very much like that :) I wonder if this behavior is expected or not
Comment From: WillAyd
My $.02 is that if you wanted dates to be sequential by day then you should be reindexing and filling the group before any groupby. So something along the lines of:
>>> dt_rng = pd.date_range(data.index.min(), data.index.max())
>>> data = data.reindex(rng)
>>> data['group'] = data['group'].ffill()
>>> data.groupby('group')['myvalue'].transform(lambda x: x-x.shift())
2016-06-06 NaN
2016-06-07 NaN
2016-06-08 NaN
2016-06-09 1.0
2016-06-10 NaN
2016-06-11 NaN
2016-06-12 NaN
2016-06-13 1.0
Freq: D, Name: myvalue, dtype: float64
Probably a little more verbose than you want to be and gives a different result than what you have, but seems the most "pandonic" to me
Comment From: jreback
lots of good answers here. .shift()
does care whether an index is complete or not, just that it is ordered. You must resample / reindex-fill yourself.
I suppose adding this example to the doc-string (or maybe in the docs might be easier) could be informative. Going to repurpose this issue for that enhancement.
Comment From: randomgambit
Hi @jeffreback thanks!
but what do you mean by shift does not care whether and index is ordered or not? Using timedelta works with shift if I am not mistaken.
essentially, I ended up writing a function that is very similar to what @chris-b1 is doing. I hope all is good
Comment From: jreback
@randomgambit
I said
does care whether an index is complete or not, just that it is ordered. You must resample / reindex-fill yourself.
it doesn't care whether its complete, just ordered.
Comment From: PyJay
I will add this example to the documentation
Comment From: PyJay
PR here - https://github.com/pandas-dev/pandas/pull/21039
Comment From: randomgambit
Hello guys, as the original OP I would be happy to contribute here!
From what I see above, you want to get rid of the pd.Timedelta shift? But that is the essence of the question! and the solution proposed by @mroeschke in #20492 works like a charm!
Here is a variant
df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12'),
pd.to_datetime('2018-01-01 14:34:13'),
pd.to_datetime('2018-01-01 14:34:15'),
pd.to_datetime('2018-01-01 14:34:16'),
pd.to_datetime('2018-01-01 14:34:28')],
'myvalue' : [1,2,np.NaN,3,1],
'mygroup' : ['a','a','a','b','b']})
df.set_index('mytime', inplace = True)
shifted_df = df.groupby('mygroup').myvalue.apply(lambda x: x - x.shift(1, pd.Timedelta('1 sec')))
shifted_df.name = 'shifted'
df = df.reset_index().merge(shifted_df.reset_index(), on = ['mytime','mygroup'], how = 'left')
df
Out[108]:
mytime mygroup myvalue shifted
0 2018-01-01 14:34:12 a 1.00000 nan
1 2018-01-01 14:34:13 a 2.00000 1.00000
2 2018-01-01 14:34:15 a nan nan
3 2018-01-01 14:34:16 b 3.00000 nan
4 2018-01-01 14:34:28 b 1.00000 nan
Why dont you just use that in the docs? Pretty nice IMHO
Comment From: jbrockmendel
@randomgambit the OP example now doesn't raise, can you try it and see if the result it gives is what you expect
Comment From: TheCheerfulCoder
take
Comment From: Jahanvi-vy
take
Comment From: kerrf
take