Welcome

Hey there! This is the third guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple regression cases so let’s get that set up first. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation and GGally to make a scatterplot matrix.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Multiple Linear Regression

Earlier, we weren’t able to predict rating with one variable alone. What if we could use multiple variables in regression model? Maybe then we can better predict rating.

Requirements

When can we use multiple linear regression (MLR)?

  • The response variable (the variable we are predicting) must be continuous.
  • The predictor variable (the variable we are using to predict) can be either continuous or categorical with any amount of levels.
  • We only get one response variable but can have as many predictor variables as we wish.

We also still have to meet the regression assumptions:

  • The relationship between the predictor variable and the response variables should be linear
  • The errors of our predictions should be approximately normally distributed
  • There should be a constant variability in our errors
  • NEW: Our predictor variables should not be correlated with each other
  • Our errors should be independent (we will not focus on this here)

Selecting Variables

The tricky part about working with MLR is that you want to find the right balance of variables. We do want to choose variables we think might have an effect on rating but we don’t want to include any unnecessary variables or variables that don’t provide much benefit to the model.

As a result, MLR is an iterative process. I like to start out by creating a full model that has every single predictor variable that I think could be useful in predicting my response variable. I then use the output from the model to try and pull out the variables that are not pulling their weight.

So let’s start with that full model. These are the variables that I think might be useful (remember that in MLR we can use categorical variables):

  • Store
  • Customer Type
  • Gender
  • Total
  • Unit Price
  • Pay Method

It is also important to think about how these predictor variables interact with each other. We should consider adding in any interaction terms if we think two variables influence each other. I think it is possible that customer member status might influence their pay method and perhaps gender and purchase total interact. However, we shouldn’t rely solely on our instinct. Before jumping right into making a model, we should plot the variables and visualize any possible interactions.

Looking at the Variables

train_subset <- train %>% 
  select(rating, store, customer_type, gender, total, unit_price, pay_method)
test_subset <- test %>% 
  select(rating, store, customer_type, gender, total, unit_price, pay_method)
ggpairs(train_subset)

Plot matrix of all variables plotted against each other

This is a large plot that shows the relationships between each pair of variables. There isn’t a lot that stands out here but I do want to notice a few things:

  • rating, our response variable, is approximately normal
  • total is skewed right
  • There is a moderately strong correlation between unit_price and total
  • customer_type and gender may be related
  • store and gender may be related
  • store and pay_method may be related

Based on those observations I would like to make a few changes. I will be taking the log of total to help correct the skew and I will be adding interaction variables between the related pairs mentioned above.

Building the Model

Because we are still working in linear regression, we can use the same lm_spec engine from SLR to drive our model building. This time, we’ll use a different formula to fit the model which will include the log of total and some interaction terms.

train_subset <- train_subset %>% 
  mutate(log_total = log(total))
test_subset <- test_subset %>% 
  mutate(log_total = log(total))

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

mlr_mod_full <- lm_spec %>% 
  fit(rating ~ store + customer_type + gender + log_total + unit_price + pay_method + unit_price*log_total + customer_type*gender + store*gender + store*pay_method, data = train_subset)

Once we have done that, we can look at the output:

tidy(mlr_mod_full) %>% 
  mutate(p.value = format.pval(p.value, digits = 4)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.7370356 0.8000191 7.1711235 1.726e-12
storeB 0.4076251 0.3028263 1.3460690 0.17867
storeC 0.2997884 0.2909336 1.0304357 0.30312
customer_typeNormal 0.1256181 0.1741896 0.7211577 0.47103
genderMale 0.4039030 0.2456883 1.6439652 0.10059
log_total 0.1875390 0.1572396 1.1926951 0.23335
unit_price 0.0187970 0.0141310 1.3301941 0.18384
pay_methodCredit card 0.0192168 0.2636199 0.0728958 0.94191
pay_methodEwallet 0.0897699 0.2521303 0.3560458 0.72190
log_total:unit_price -0.0034288 0.0025746 -1.3317756 0.18332
customer_typeNormal:genderMale -0.0946899 0.2452180 -0.3861457 0.69949
storeB:genderMale -0.5277196 0.2992870 -1.7632560 0.07825
storeC:genderMale -0.3982474 0.3012989 -1.3217686 0.18663
storeB:pay_methodCredit card -0.1173402 0.3729626 -0.3146166 0.75314
storeC:pay_methodCredit card 0.1632780 0.3710021 0.4400999 0.65999
storeB:pay_methodEwallet -0.4414774 0.3663241 -1.2051553 0.22851
storeC:pay_methodEwallet -0.1200978 0.3607408 -0.3329201 0.73928

We see that R has created dummy variables for us for each of our categorical variables. However, because of this, we have a very large number of predictor variables. Our goal is to reduce the amount of predictor variables down to the essentials. We want to predict rating with the most accuracy but with the fewest variables we can.

One way to judge this is to look at the p-values for each predictor which essentially looks at if the addition of one variable contributes something to the model (by comparing a model with the variable to one without). However, I would caution against relying solely on p-values. They are great to help us get an idea of the tendencies of the model and can give us a direction about where to go next but I personally say that they do not and in most cases should not drive decision making. I will not be ignoring p-values but I also will not be using them as a hard and fast rule for including or not including a variable.

For instance, look at our model. Not a single variable is significant, at the typical cutoff of 0.05. To me, this does not mean that we should scratch every variable and try again. It does not mean that these variables are useless at predicting rating. This indicates to me that maybe we do not have enough information in the data to predict rating accurately or maybe we have too many variables.

I think it’s also important to look at the estimates. Some of the estimates have larger values, like Store B. Here, if a certain transaction was done at Store B and all other variables were held constant, we would predict the rating of a transaction to increase by 0.411. Other estimates have very small values, such as the interaction between unit_price and total. With this term, for every dollar increase of the product of unit_price and total, we would expect the predicted rating to decrease by 0.00343 points. This does not necessarily mean that this interaction variable has a marginally small effect on rating. Remember that the values for the log of total and unit price can be pretty high and can influence the prediction. The highest unit_price value in train_subset is 99.96 with a corresponding total of 735. Multiplying the log of the total by the unit price gives 659.7231. If we insert that into our model, we can say that for this observation we would expect the predicted rating to decrease by \(659.7231*0.00343 = 2.263\) points!

So it’s important to consider the practical piece of your model. Even if a term doesn’t meet the cutoff for its p-value, it may be close enough to still merit inclusion in the final model, especially if you think that it makes sense to include in the context of the dataset.

In the end, it is up to you which variables to use. There are variable selection procedures out there (forward selection, backward selection, and stepwise selection) that can help you with this if you choose to use them. You may also be interested in dimension reduction techniques such as Principal Components Analysis (PCA) and other factor analysis procedures. These are all out of the scope of this guide but can be useful in certain cases. I would recommend treating each method with some skepticism as I believe that no method will be able to tell you with 100% certainty which variables to include (and some of the methods produce models that are tricky to interpret). With the model here, I will be using my judgement by looking at the p-values, values for the estimates, and thinking about the context of the data to determine if I think a variable is useful.

I’ll detail my thoughts below:

  • We saw a fairly high correlation between the unit price of an item and the log of the total. The interaction term has a fairly low p-value, an influential estimate value, and the individual variables have a noticeable estimate and low p-values. Furthermore, it does make some logical sense for this to be used to predict rating: higher prices make customers unhappy.
  • I’m not liking the p-values of customer_type and payment_method and most of the interactions including it. They don’t seem significant and the estimate values are pretty small.
  • We may be getting somewhere with the gender and store interaction. I see some nice p-values and some larger estimates. This isn’t an interaction I was expecting but, if true, this could be very useful information to the company to help them with advertising.
  • Recommended practice for regression is, if including an interaction variable, you should also include the individual variables that make up that interaction, even if they are not significant or don’t have a large effect.

With all this being said, let’s make a reduced model!

mlr_mod_reduced <- lm_spec %>% 
  fit(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = train_subset)

tidy(mlr_mod_reduced) %>% 
  mutate(p.value = format.pval(p.value, digits = 4)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.7255485 0.7818191 7.3233673 5.969e-13
storeB 0.2086592 0.2138560 0.9756993 0.32951
storeC 0.3082288 0.2123498 1.4515141 0.14703
genderMale 0.3533736 0.2093694 1.6877995 0.09184
log_total 0.2111048 0.1555520 1.3571331 0.17513
unit_price 0.0206734 0.0139899 1.4777356 0.13988
storeB:genderMale -0.5281460 0.2975677 -1.7748764 0.07630
storeC:genderMale -0.4069470 0.2988545 -1.3616893 0.17368
log_total:unit_price -0.0037972 0.0025461 -1.4913614 0.13627

I’m liking this model better. We have similar results (low p-values and decently-sized estimates) and we are using less variables. Let’s stick with this model for the rest of the guide.

Assessing the Model

Now let’s assess the model and see if regression was a good choice. We’ll start with overall performance and then check the assumptions.

model_performance(mlr_mod_full) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3157.101 3157.978 3241.402 0.0127147 -0.0074855 1.706068 1.724512
model_performance(mlr_mod_reduced) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3144.37 3144.65 3191.204 0.0086671 -0.0013717 1.709562 1.719272

We will look at the AIC and BIC as we did with SLR and I’d also like introduce the Root Mean Square Error (RMSE). The RMSE is the standard deviation of our prediction errors. We’d like this to be small as that means our errors are less spread out (and therefore smaller).

We can’t interpret any of these statistics alone; we use them to compare between models which is why I ran the model performance function on both our full model and our reduced model. We do see both the AIC and the BIC decrease slightly which is good. Our RMSE actually increases here but by a very little amount. My biggest takeaway is that the reduced model and the full model are very similar in terms of their performance. In cases like this, we would prefer to choose the model with the fewest variables. Why use more when we can do similarly with few?

Another thing to note is that our RMSE is pretty large compared to the range of the response variable, rating. Most of our errors are around 1.7 with some extending beyond that (bigger errors) and some less. Given that rating only spans from 4 to 10, an error of 1.7 could mean the difference between a neutral customer and a satisfied customer. Because the error is so big compared to the response, I don’t have much confidence in the accuracy of this model.

Now on to the assumptions. Most of our assumptions are the same as SLR; however, we do have one more which involves collinearity (correlation between predictors).

diagnostics <- plot(check_model(mlr_mod_reduced, panel = FALSE))
## Model has interaction terms. VIFs might be inflated.
##   You may check multicollinearity among predictors of a model without
##   interaction terms.

Linearity

First, can we assume linearity between the response and the predictor?

diagnostics[[2]] +
  my_theme