Welcome

Hey there! This is the second guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple regression cases so let’s get that set up first. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation and GGally to make a scatterplot matrix. The performance package helps us check the model after we’ve created it and the others are helpful for rendering good tables.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Simple Linear Regression (SLR)

Although I’ll avoid any complex math throughout this guide, it is important to think about how regression works. The formula for simple linear regression for a population is

\[ y = \beta_0 + \beta_1 x_1 + \varepsilon \]

which states that we can represent the relationship between our predictor variable and our response variable as a straight line plus some natural error (\(\varepsilon\)). The problem is, we don’t know the information about our population (i.e., we don’t have data from all transactions from all stores). Instead, we have a subset of that data (some transactions froms ome stores) and are looking to predict the population values using our sample. As a result, we are creating a model of the form

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] There is a difference here. First, we’ve lost the error term since we are now in prediction mode (there is no error since we know the data we collected). You also now see hats above a few of the values. This indicates that they are estimates of the true values.

Requirements

When might we want to use SLR? Well, first we need to make sure we have the correct variable types:

  • The response variable (the variable we are predicting) must be continuous.
  • The predictor variable (the variable we are using to predict) can be either continuous or categorical with two levels.
  • We only get one predictor variable and one response variable.

Second, we have to meet certain criteria. See, simple linear regression can be somewhat limiting as it requires us to make certain assumptions about our data and the population we are hoping to model due to the mathematical formulas and matrix algebra used. This is what we should pay attention to:

  • The relationship between the predictor variable and the response variable(s) should be linear
  • The errors of our predictions should be approximately normally distributed
  • There should be a constant variability in our errors
  • Our errors should be independent (we will not focus on this here)

These will be explored later on when we investigate our variables.

Selecting Variables

Given all of that information, let’s choose the variables we want to use in the model. Suppose we want to have a way to predict the rating of a transaction. The rating variable is continuous so it is acceptable to use as a response variable. To try and predict rating, we will consider a few candidate models (each with one predictor variable). Let’s start with total, and unit_price as possible predictors.

At this point, it’s a good idea to write down your predictions about what might happen. This is important as you want to make sure you don’t create a prediction based on your results so you don’t create any bias. At this point, I am thinking that unit_price will have the most effect on rating since I think that when items cost more, people are more likely to be grumpy and give a lower rating.

Using the formula above, we can create two possible models that we want to use to predict rating:

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{total} \]

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{unit_price} \] In each case, we are trying to find the values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using our predictor variable.

Building the Model

Let’s build the model! We will use the tidymodels family of packages for this. We first set up the regression engine, then fit the models with our data. Based on the work done in the first guide, we saw that total has some right skew to it due to some outliers to the right. As a result, we are going to take the log of total.

train_subset <- train %>% 
  select(rating, unit_price, total) %>% 
  mutate(log_total = log(total))

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

slr_mod_price <- lm_spec %>% 
  fit(rating ~ unit_price, data = train_subset)

slr_mod_total <- lm_spec %>% 
  fit(rating ~ log_total, data = train_subset)

Above, I fit two different models, one with unit_price and one with total. Let’s take a look at the output for both of them.

tidy(slr_mod_price) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 6.987286 0.1422717 49.1122698 0.0000000
unit_price 0.000548 0.0023023 0.2380066 0.8119372
tidy(slr_mod_total) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 6.9220710 0.366991 18.8616934 0.0000000
log_total 0.0176607 0.066701 0.2647747 0.7912515

We can use this output to create a regression line equation using the above formulas:

\[ \widehat{\text{rating}} = 6.99 + 0.000548 * \text{unit price} \]

\[ \widehat{\text{rating}} = 6.92 + 0.0177 * \log{\text{(total)}} \] In other words, because the slopes for both of the lines are very small, we have very little evidence of a linear relationship between rating and both unit_price and the log of total.

Assessing the Model

Let’s see how the models perform! My guess is not very well based on the information so far.

model_performance(slr_mod_price) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3137.269 3137.299 3151.319 7.11e-05 -0.0011835 1.716958 1.71911
model_performance(slr_mod_total) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3137.255 3137.286 3151.305 8.8e-05 -0.0011666 1.716943 1.719096

We can get a good idea of this by looking at \(R^2\), or the proportion of the variance in rating we have explained. When we use both unit_price and total, we can explain less than 1% of the variance in rating. Not so good! There is very little difference between the two models. The AIC, or Akaike’s information criteria assess our error. We want to see it decrease as we look at different models. The same goes for the BIC, or Bayesian information criterion. We also want to see \(R^2\) increase.

Due to the similar models, I will be finishing the SLR section with only one model: the one with log(total) which does seem to be the slightly “better” model here.

Assumptions

The performance packages offers us a nice glance at some plots to help answer if we have met the assumptions.

diagnostics <- plot(check_model(slr_mod_total, panel = FALSE))

Linearity

First, can we assume linearity between the response and the predictor?

diagnostics[[2]] +
  my_theme

Dot plot of predicted values vs. errors. Most points lie towards the higher end of the plot and are sparser towards the lower end. Predicted values range from 6.96 to 7.05 and errors from -3 to 3.

Well, this is a great example of how most real life data is not going to be perfectly linear. This data certainly shows no linear pattern; in fact, this looks more like a random scatter with no pattern than anything else. It’s close, but I wouldn’t say that we have satisfied this condition. We can see as well that the reference line in the plot is not flat for lower fitted values.

cor(train$rating, log(train$total)) %>% 
  round(5) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
x
0.00938

We can also see that the correlation between the two variables is very close to zero indicating that there is low evidence of a strong relationship between rating and the log of total.

Constant Variability of Errors

diagnostics[[3]] +
  my_theme