04 May 2018
timeseries lag

Data Munging for TimeSeries: Lagging Variables Across Multiple Groups

Concepts: TimeSeries, Method Chaining, GroupBy, Lagged columns, and Index manipulation

Modeling time series data can be challenging, so it makes sense that some data enthusiasts (including myself) put off learning this topic until they absolutely have to. Before you can apply machine learning models to time series data, you have to transform it to an “ingestible” format for your models, and this often involves calculating lagged variables, which can measure auto-correlation i.e. how past values of a variable influence its future values, thus unlocking predictive value. Below are 3 different approaches I have used recently to generate lagged variables in Pandas:

  • Lag one or more variables across one group — using shift method
  • Lag one variable across multiple groups — using unstack method
  • Lag multiple variables across multiple groups — with groupby

First, let’s generate some dummy time series data as it would appear “in the wild” and put it into two dataframes for illustrative purposes.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(0) # ensures the same set of random numbers are generated
date = ['2019-01-01']*3 + ['2019-01-02']*3 + ['2019-01-03']*3
var1, var2 = np.random.randn(9), np.random.randn(9)*20
group = ["group1", "group2", "group3"]*3 # to assign the groups for the multiple group case

df_manygrp = pd.DataFrame({"date": date, "group":group, "var1": var1}) # one var, many groups
df_combo = pd.DataFrame({"date": date, "group":group, "var1": var1, "var2": var2}) # many vars, many groups
df_onegrp = df_manygrp[df_manygrp["group"]=="group1"] # one var, one group
In [216]:
df_onegrp # first dataframe
Out[216]:
date group var1
0 2019-01-01 group1 1.764052
3 2019-01-02 group1 2.240893
6 2019-01-03 group1 0.950088
In [153]:
df_manygrp # second dataframe
Out[153]:
date group var1
0 2019-01-01 group1 1.764052
1 2019-01-01 group2 0.400157
2 2019-01-01 group3 0.978738
3 2019-01-02 group1 2.240893
4 2019-01-02 group2 1.867558
5 2019-01-02 group3 -0.977278
6 2019-01-03 group1 0.950088
7 2019-01-03 group2 -0.151357
8 2019-01-03 group3 -0.103219
In [154]:
df_combo # third dataframe
Out[154]:
date group var1 var2
0 2019-01-01 group1 1.764052 8.211970
1 2019-01-01 group2 0.400157 2.880871
2 2019-01-01 group3 0.978738 29.085470
3 2019-01-02 group1 2.240893 15.220755
4 2019-01-02 group2 1.867558 2.433500
5 2019-01-02 group3 -0.977278 8.877265
6 2019-01-03 group1 0.950088 6.673487
7 2019-01-03 group2 -0.151357 29.881581
8 2019-01-03 group3 -0.103219 -4.103165
In [155]:
for d in [df_onegrp, df_manygrp, df_combo]: # loop to apply the change to both dfs
    d["date"] = pd.to_datetime(d['date']) # date column to datetime
    print("Column changed to: ", d.date.dtype.name)
Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]
C:\Users\tichmangono\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

One group, one variable

In [156]:
df_onegrp.set_index(["date"]).shift(1)
Out[156]:
group var1
date
2019-01-01 NaN NaN
2019-01-02 group1 1.764052
2019-01-03 group1 2.240893

Many groups, one variable

In [202]:
df = df_manygrp.set_index(["date", "group"])
df = dd.unstack().shift(1)
df= dd.stack(dropna=False)

dd.reset_index().sort_values("group")
Out[202]:
date group var1
0 2019-01-01 group1 1.764052
1 2019-01-01 group2 0.400157
2 2019-01-01 group3 0.978738
3 2019-01-02 group1 2.240893
4 2019-01-02 group2 1.867558
5 2019-01-02 group3 -0.977278
6 2019-01-03 group1 0.950088
7 2019-01-03 group2 -0.151357
8 2019-01-03 group3 -0.103219

Many groups, Many variables

In [205]:
grouped_df = df_combo.groupby(["group"])

def lag_by_group(key, value_df):
    df = value_df.assign(group = key) # this pandas method returns a copy of the df, with group columns assigned the key value
    return (df.sort_values(by=["date"], ascending=True)
        .set_index(["date"])
        .shift(1)
               ) # the parenthesis allow you to chain methods and avoid intermediate variable assignment

dflist = [lag_by_group(g, grouped_df.get_group(g)) for g in grouped_df.groups.keys()]
pd.concat(dflist, axis=0).reset_index()

Tutorials 2 timeseries 1 , tutorial 2 , python 4 , data analysis 2 , pandas 2 ,

blog comments powered by Disqus