In this post we will take a very brief look at how to interpret linear regression coefficients. We will then move on to how to visualize interaction terms for continuous variables, and finally how to read interaction coefficients.
Note: unlike most of the other posts on this blog, this post is written in Python.
The first step is to load the usual suspect modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
%matplotlib inline
Next up is to load the (famous!) dataset:
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mtcars.head() #take a quick look
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Our goal is to explore how the weight of a car and the number of cylinders are related to the mileage of the car. Additionally, we are interested to know how the weight affects the relationship between the number of cylinders and the mileage. We will answer both of these questions below, but first let’s plot the data to get a sense of how it looks:
sns.lmplot(x='wt', y='mpg', hue='cyl', data=mtcars, fit_reg=False, palette='viridis', size=5, aspect=2.5)
plt.ylabel("Miles per Gallon")
plt.xlabel("Vehicle Weight");
This dataset shows a number of parameters for different cars. The goal is to predict miles per gallon (mpg) given number of cylinders (cyl) and weight (wt). That is, we wish to regress mpg on wt and cyl
Using the statsmodels module we are able to use R style formula input (which includes an intercept term by default):
model = smf.ols(formula='mpg ~ wt + cyl', data=mtcars).fit()
summary = model.summary()
summary.tables[1]
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 39.6863 | 1.715 | 23.141 | 0.000 | 36.179 43.194 |
wt | -3.1910 | 0.757 | -4.216 | 0.000 | -4.739 -1.643 |
cyl | -1.5078 | 0.415 | -3.636 | 0.001 | -2.356 -0.660 |
The resulting model is written as:
\[ mpg = \beta_0 + \beta_1\cdot{wt} + \beta_2\cdot{cyl} \]
\[ mpg = 39.69 - 3.19\cdot{wt} - 1.51\cdot{cyl} \]
The coefficients of the model can be read as follows:
We also note that the coefficients for wt and cyl have significant p-values (at the common alpha level of 0.05). We interpret the coefficient p-values as the p-value for the hypothesis test where:
Thus in both cases we reject the null hypothesis.
Finally, this model has an adjusted \(R^2\) of 0.819 - we will look to improve this with an interaction term.
By claiming there may be an interaction between weight and cylinder, we are saying that we believe the relationship between the weight of the vehicle and the mpg is different for vehicles of different numbers of cylinders.
We can add this to our model as follows:
model_interaction = smf.ols(formula='mpg ~ wt + cyl + wt:cyl', data=mtcars).fit()
summary = model_interaction.summary()
summary.tables[1]
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 54.3068 | 6.128 | 8.863 | 0.000 | 41.755 66.858 |
wt | -8.6556 | 2.320 | -3.731 | 0.001 | -13.408 -3.903 |
cyl | -3.8032 | 1.005 | -3.784 | 0.001 | -5.862 -1.745 |
wt:cyl | 0.8084 | 0.327 | 2.470 | 0.020 | 0.138 1.479 |
This time, the adjusted \(R^2\) of our model is 0.846, and improvement over the previous value without the interaction term (0.819). We also see that the coefficients on both wt and cyl have changed, but remain significant, and the interaction term is significant. This is evidence that there is an interaction between the variables.
There is no magic in interaction terms, we can see that the term is equivalent to simply multiplying the variables together and using the resulting values as a new predictor:
#Create new variable
mtcars['wt_cyl'] = mtcars.wt * mtcars.cyl
model_multiply = smf.ols(formula='mpg ~ wt + cyl + wt_cyl', data=mtcars).fit()
summary = model_multiply.summary()
summary.tables[1]
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 54.3068 | 6.128 | 8.863 | 0.000 | 41.755 66.858 |
wt | -8.6556 | 2.320 | -3.731 | 0.001 | -13.408 -3.903 |
cyl | -3.8032 | 1.005 | -3.784 | 0.001 | -5.862 -1.745 |
wt_cyl | 0.8084 | 0.327 | 2.470 | 0.020 | 0.138 1.479 |
In contrast to the previous formulation, our new formulation is:
\[ mpg = \beta_0 + \beta_1\cdot{wt} + \beta_2\cdot{cyl} + \beta_3\cdot{wt}\cdot{cyl} \]
\[ mpg = 54.31 - 8.66\cdot{wt} - 3.80\cdot{cyl} + 0.81\cdot{wt}\cdot{cyl} \]
But how do we interpret these new coefficients? We say:
For every 1 unit increase in weight, mpg decreases by \(8.66\) (holding cylinders at 0)
For every 1 unit increase in weight, mpg changes by \(-8.66 + cyl\cdot{0.81}\))
For every 1 unit increase in cylinders, mpg decreases by \(3.80\) (holding weight at 0)
For every 1 unit increase in cylinders, mpg changes by \(-3.80 + wt\cdot{0.81}\))
At 0 weight and 0 cylinders, we expect mpg to be 54.31
We can see the interaction by cutting one of the terms in the interaction along it’s median, and then plotting the response variable against the other variable in the interacting pair:
mtcars['cyl_med'] = mtcars.cyl > mtcars.cyl.median()
mtcars['cyl_med'] = np.where(mtcars.cyl_med == False, "Below Median", "Above Median")
sns.lmplot(x='wt', y='mpg', hue='cyl_med', data=mtcars, ci=None, size=5, aspect=2.5);
What this plot shows us, is that when the cylinder value is small (i.e. below the median value), the relationship between mpg and wt is strongly negative. Conversely, at higher cylinder values, there is a much weaker relationship between mpg and wt!
In a plot such as this, the larger the difference in slopes, the larger the interaction effect.