python - Pandas tshift slow in groups -


Using panda tshift is awesome It's very fast!

  df = pd.DataFrame (index = PDAdit_Range (pd.datetime (1970,1,1), pd.datetime (19702,1)) df ['data'] = .5% Timed df.sum () # 10000 loops, 3: 162 μs per loop% timeit df.tshift (-1) # 1000 loops best, 3: 307 micron per loop # x2 slow   

But when I do group after tshift , it makes it very slow:

  Df = pd.DataFrame (index = Pd.date_range (pd.datetime (1970,1,1), pd.datetime (19702,1)) df ['data'] = .5 df ['A'] = randink (0,2, lane (df .index)% timeit df.groupby ('a'). Sum () # 100 loops, best 3: 2.72 ms per loop% ti Meit df.groupby ('a'). Hole (-1) # 10 loops, best 3: 16 loop per MSS # x6 slow!   

Why grouping at tshift

Update:

My actual use case is close to the code given below, I think the slow multiplier The size depends on the number of the group.

  n_a = 50 n_B = 5 index = pd.MultiIndex.from_product ([arange (n_A), arange (n_B), pd.date_range (pd. Dat_range (pd.datetime (1975,1,1), pd.datetime (2010,1,1), freq = '5AS'), name = ['a', 'b', 'year']) Df = pd.DataFrame ( Ndeks = index) DF [ 'data'] = 5% Taimit df.reset_index ([ 'A', 'B']). Group (['A', 'B']). (# 'A', 'B', Freq = '5AS') # 10 loops, best 3: 193 ms for loop # X44 slowdown   

While we have a B. Increase the number of groups:

  n_a = 500 n_B = 50 ...% timeit df.reset_index (['(' a ',' b ']). ([' A ',' B ']). () # 10 loops, best 3: 35.8 ms per loop% timeit df.reset_index ([' a ',' b '].) Group ([' a ',' b ' ). TISHFT (-1, Freak = '5 AS') # 1 loops, best 3: 20.3 per loop # X567 downfall   

I wonder if the recession increases with the number of groups Is it a clever to do this? The way?

tshift requires a freq logic (Because freq is probably not regular and regular once in your group), so df.groupby ('A'). Tshift (-1) gives an empty frame (this Group is rising for everyone, it is also slowing down).

 In  [44]:% timeit df.groupby ('A'). Tshift (-1, 'D') 100 loops, best in 3: 3.57 MS per loop [45]:% timeit df.groupby ('A'). Yoga () 1000 loops, best 3: 1.02 ms per loop   

In addition to this, this issue is also waiting for a sithonized implementation of poly (and teasoft) They make it at par, which is siyonized. Contribution Welcome!

By using your second dataset (large group), you can:

  in [59]: def f (df): ....: X = df.reset_index () ....: x ['year_ts'] = pd.DatetimeIndex (x ['year']) - pd.offsets.YearBegin (5) ....: return x.drop ([ 'Year'], axis = 1). Name (column = {"year of year": 'year'}). Set_indix (['a', 'b', 'year']) ....: in [60]: results = df.reset_index (['A', 'B']). Group (['A', 'B']). [1] 'tshift (-1,' 5AS ') in [61]:% timeit df.reset_index ([' A ',' B ']). Group (['A', 'B']). Tishfoot (-1, '5 AS') 1 loops, best at 3: 10.8 per loop [62]: [64]: result.equals (result2) out [64]: true  <3: 2.51 S best in loop:% 3 = F (DF) [63]:% timeit f (df) 1 loops, / Pre> 

Therefore, makes the date-time out of the group four times faster. And he (and caching) are investigating the first phase to make group tshift faster.

Comments

Popular posts from this blog

java - ImportError: No module named py4j.java_gateway -

python - Receiving "KeyError" after decoding json result from url -

.net - Creating a new Queue Manager and Queue in Websphere MQ (using C#) -