Data Munging for TimeSeries: Lagging Variables Across Multiple Groups¶
Concepts: TimeSeries, Method Chaining, GroupBy, Lagged columns, and Index manipulation
Modeling time series data can be challenging, so it makes sense that some data enthusiasts (including myself) put off learning this topic until they absolutely have to. Before you can apply machine learning models to time series data, you have to transform it to an “ingestible” format for your models, and this often involves calculating lagged variables, which can measure auto-correlation i.e. how past values of a variable influence its future values, thus unlocking predictive value. Below are 3 different approaches I have used recently to generate lagged variables in Pandas:
- Lag one or more variables across one group — using shift method
- Lag one variable across multiple groups — using unstack method
- Lag multiple variables across multiple groups — with groupby
First, let’s generate some dummy time series data as it would appear “in the wild” and put it into two dataframes for illustrative purposes.
import pandas as pd
import numpy as np
np.random.seed(0) # ensures the same set of random numbers are generated
date = ['2019-01-01']*3 + ['2019-01-02']*3 + ['2019-01-03']*3
var1, var2 = np.random.randn(9), np.random.randn(9)*20
group = ["group1", "group2", "group3"]*3 # to assign the groups for the multiple group case
df_manygrp = pd.DataFrame({"date": date, "group":group, "var1": var1}) # one var, many groups
df_combo = pd.DataFrame({"date": date, "group":group, "var1": var1, "var2": var2}) # many vars, many groups
df_onegrp = df_manygrp[df_manygrp["group"]=="group1"] # one var, one group
df_onegrp # first dataframe
df_manygrp # second dataframe
df_combo # third dataframe
for d in [df_onegrp, df_manygrp, df_combo]: # loop to apply the change to both dfs
d["date"] = pd.to_datetime(d['date']) # date column to datetime
print("Column changed to: ", d.date.dtype.name)
One group, one variable¶
df_onegrp.set_index(["date"]).shift(1)
Many groups, one variable¶
df = df_manygrp.set_index(["date", "group"])
df = dd.unstack().shift(1)
df= dd.stack(dropna=False)
dd.reset_index().sort_values("group")
Many groups, Many variables¶
grouped_df = df_combo.groupby(["group"])
def lag_by_group(key, value_df):
df = value_df.assign(group = key) # this pandas method returns a copy of the df, with group columns assigned the key value
return (df.sort_values(by=["date"], ascending=True)
.set_index(["date"])
.shift(1)
) # the parenthesis allow you to chain methods and avoid intermediate variable assignment
dflist = [lag_by_group(g, grouped_df.get_group(g)) for g in grouped_df.groups.keys()]
pd.concat(dflist, axis=0).reset_index()