Data Munging for TimeSeries: Lagging Variables Across Multiple Groups¶

Concepts: TimeSeries, Method Chaining, GroupBy, Lagged columns, and Index manipulation

Modeling time series data can be challenging, so it makes sense that some data enthusiasts (including myself) put off learning this topic until they absolutely have to. Before you can apply machine learning models to time series data, you have to transform it to an “ingestible” format for your models, and this often involves calculating lagged variables, which can measure auto-correlation i.e. how past values of a variable influence its future values, thus unlocking predictive value. Below are 3 different approaches I have used recently to generate lagged variables in Pandas:

Lag one or more variables across one group — using shift method
Lag one variable across multiple groups — using unstack method
Lag multiple variables across multiple groups — with groupby

First, let’s generate some dummy time series data as it would appear “in the wild” and put it into two dataframes for illustrative purposes.

import pandas as pd
import numpy as np

np.random.seed(0) # ensures the same set of random numbers are generated
date = ['2019-01-01']*3 + ['2019-01-02']*3 + ['2019-01-03']*3
var1, var2 = np.random.randn(9), np.random.randn(9)*20
group = ["group1", "group2", "group3"]*3 # to assign the groups for the multiple group case

df_manygrp = pd.DataFrame({"date": date, "group":group, "var1": var1}) # one var, many groups
df_combo = pd.DataFrame({"date": date, "group":group, "var1": var1, "var2": var2}) # many vars, many groups
df_onegrp = df_manygrp[df_manygrp["group"]=="group1"] # one var, one group

df_onegrp # first dataframe

df_manygrp # second dataframe

df_combo # third dataframe

for d in [df_onegrp, df_manygrp, df_combo]: # loop to apply the change to both dfs
    d["date"] = pd.to_datetime(d['date']) # date column to datetime
    print("Column changed to: ", d.date.dtype.name)

Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]

C:\Users\tichmangono\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

One group, one variable¶

df_onegrp.set_index(["date"]).shift(1)

Many groups, one variable¶

df = df_manygrp.set_index(["date", "group"])
df = dd.unstack().shift(1)
df= dd.stack(dropna=False)

dd.reset_index().sort_values("group")

Many groups, Many variables¶

grouped_df = df_combo.groupby(["group"])

def lag_by_group(key, value_df):
    df = value_df.assign(group = key) # this pandas method returns a copy of the df, with group columns assigned the key value
    return (df.sort_values(by=["date"], ascending=True)
        .set_index(["date"])
        .shift(1)
               ) # the parenthesis allow you to chain methods and avoid intermediate variable assignment

dflist = [lag_by_group(g, grouped_df.get_group(g)) for g in grouped_df.groups.keys()]
pd.concat(dflist, axis=0).reset_index()

	date	group	var1
0	2019-01-01	group1	1.764052
3	2019-01-02	group1	2.240893
6	2019-01-03	group1	0.950088

	group	var1
date
2019-01-01	NaN	NaN
2019-01-02	group1	1.764052
2019-01-03	group1	2.240893

	date	group	var1	var2
0	2019-01-01	group1	1.764052	8.211970
1	2019-01-01	group2	0.400157	2.880871
2	2019-01-01	group3	0.978738	29.085470
3	2019-01-02	group1	2.240893	15.220755
4	2019-01-02	group2	1.867558	2.433500
5	2019-01-02	group3	-0.977278	8.877265
6	2019-01-03	group1	0.950088	6.673487
7	2019-01-03	group2	-0.151357	29.881581
8	2019-01-03	group3	-0.103219	-4.103165

Time Series Data Munging Lagging variables across multiple groups timeseries, lag variables, method-chaining, stack/unstack

Data Munging for TimeSeries: Lagging Variables Across Multiple Groups¶

One group, one variable¶

Many groups, one variable¶

Many groups, Many variables¶