Welcome

Hey there! This is the third guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple regression cases so let’s get that set up first. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation and GGally to make a scatterplot matrix.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Multiple Linear Regression

Earlier, we weren’t able to predict rating with one variable alone. What if we could use multiple variables in regression model? Maybe then we can better predict rating.

Requirements

When can we use multiple linear regression (MLR)?

  • The response variable (the variable we are predicting) must be continuous.
  • The predictor variable (the variable we are using to predict) can be either continuous or categorical with any amount of levels.
  • We only get one response variable but can have as many predictor variables as we wish.

We also still have to meet the regression assumptions:

  • The relationship between the predictor variable and the response variables should be linear
  • The errors of our predictions should be approximately normally distributed
  • There should be a constant variability in our errors
  • NEW: Our predictor variables should not be correlated with each other
  • Our errors should be independent (we will not focus on this here)

Selecting Variables

The tricky part about working with MLR is that you want to find the right balance of variables. We do want to choose variables we think might have an effect on rating but we don’t want to include any unnecessary variables or variables that don’t provide much benefit to the model.

As a result, MLR is an iterative process. I like to start out by creating a full model that has every single predictor variable that I think could be useful in predicting my response variable. I then use the output from the model to try and pull out the variables that are not pulling their weight.

So let’s start with that full model. These are the variables that I think might be useful (remember that in MLR we can use categorical variables):

  • Store
  • Customer Type
  • Gender
  • Total
  • Unit Price
  • Pay Method

It is also important to think about how these predictor variables interact with each other. We should consider adding in any interaction terms if we think two variables influence each other. I think it is possible that customer member status might influence their pay method and perhaps gender and purchase total interact. However, we shouldn’t rely solely on our instinct. Before jumping right into making a model, we should plot the variables and visualize any possible interactions.

Looking at the Variables

train_subset <- train %>% 
  select(rating, store, customer_type, gender, total, unit_price, pay_method)
test_subset <- test %>% 
  select(rating, store, customer_type, gender, total, unit_price, pay_method)
ggpairs(train_subset)

Plot matrix of all variables plotted against each other

This is a large plot that shows the relationships between each pair of variables. There isn’t a lot that stands out here but I do want to notice a few things:

  • rating, our response variable, is approximately normal
  • total is skewed right
  • There is a moderately strong correlation between unit_price and total
  • customer_type and gender may be related
  • store and gender may be related
  • store and pay_method may be related

Based on those observations I would like to make a few changes. I will be taking the log of total to help correct the skew and I will be adding interaction variables between the related pairs mentioned above.

Building the Model

Because we are still working in linear regression, we can use the same lm_spec engine from SLR to drive our model building. This time, we’ll use a different formula to fit the model which will include the log of total and some interaction terms.

train_subset <- train_subset %>% 
  mutate(log_total = log(total))
test_subset <- test_subset %>% 
  mutate(log_total = log(total))

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

mlr_mod_full <- lm_spec %>% 
  fit(rating ~ store + customer_type + gender + log_total + unit_price + pay_method + unit_price*log_total + customer_type*gender + store*gender + store*pay_method, data = train_subset)

Once we have done that, we can look at the output:

tidy(mlr_mod_full) %>% 
  mutate(p.value = format.pval(p.value, digits = 4)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.7370356 0.8000191 7.1711235 1.726e-12
storeB 0.4076251 0.3028263 1.3460690 0.17867
storeC 0.2997884 0.2909336 1.0304357 0.30312
customer_typeNormal 0.1256181 0.1741896 0.7211577 0.47103
genderMale 0.4039030 0.2456883 1.6439652 0.10059
log_total 0.1875390 0.1572396 1.1926951 0.23335
unit_price 0.0187970 0.0141310 1.3301941 0.18384
pay_methodCredit card 0.0192168 0.2636199 0.0728958 0.94191
pay_methodEwallet 0.0897699 0.2521303 0.3560458 0.72190
log_total:unit_price -0.0034288 0.0025746 -1.3317756 0.18332
customer_typeNormal:genderMale -0.0946899 0.2452180 -0.3861457 0.69949
storeB:genderMale -0.5277196 0.2992870 -1.7632560 0.07825
storeC:genderMale -0.3982474 0.3012989 -1.3217686 0.18663
storeB:pay_methodCredit card -0.1173402 0.3729626 -0.3146166 0.75314
storeC:pay_methodCredit card 0.1632780 0.3710021 0.4400999 0.65999
storeB:pay_methodEwallet -0.4414774 0.3663241 -1.2051553 0.22851
storeC:pay_methodEwallet -0.1200978 0.3607408 -0.3329201 0.73928

We see that R has created dummy variables for us for each of our categorical variables. However, because of this, we have a very large number of predictor variables. Our goal is to reduce the amount of predictor variables down to the essentials. We want to predict rating with the most accuracy but with the fewest variables we can.

One way to judge this is to look at the p-values for each predictor which essentially looks at if the addition of one variable contributes something to the model (by comparing a model with the variable to one without). However, I would caution against relying solely on p-values. They are great to help us get an idea of the tendencies of the model and can give us a direction about where to go next but I personally say that they do not and in most cases should not drive decision making. I will not be ignoring p-values but I also will not be using them as a hard and fast rule for including or not including a variable.

For instance, look at our model. Not a single variable is significant, at the typical cutoff of 0.05. To me, this does not mean that we should scratch every variable and try again. It does not mean that these variables are useless at predicting rating. This indicates to me that maybe we do not have enough information in the data to predict rating accurately or maybe we have too many variables.

I think it’s also important to look at the estimates. Some of the estimates have larger values, like Store B. Here, if a certain transaction was done at Store B and all other variables were held constant, we would predict the rating of a transaction to increase by 0.411. Other estimates have very small values, such as the interaction between unit_price and total. With this term, for every dollar increase of the product of unit_price and total, we would expect the predicted rating to decrease by 0.00343 points. This does not necessarily mean that this interaction variable has a marginally small effect on rating. Remember that the values for the log of total and unit price can be pretty high and can influence the prediction. The highest unit_price value in train_subset is 99.96 with a corresponding total of 735. Multiplying the log of the total by the unit price gives 659.7231. If we insert that into our model, we can say that for this observation we would expect the predicted rating to decrease by \(659.7231*0.00343 = 2.263\) points!

So it’s important to consider the practical piece of your model. Even if a term doesn’t meet the cutoff for its p-value, it may be close enough to still merit inclusion in the final model, especially if you think that it makes sense to include in the context of the dataset.

In the end, it is up to you which variables to use. There are variable selection procedures out there (forward selection, backward selection, and stepwise selection) that can help you with this if you choose to use them. You may also be interested in dimension reduction techniques such as Principal Components Analysis (PCA) and other factor analysis procedures. These are all out of the scope of this guide but can be useful in certain cases. I would recommend treating each method with some skepticism as I believe that no method will be able to tell you with 100% certainty which variables to include (and some of the methods produce models that are tricky to interpret). With the model here, I will be using my judgement by looking at the p-values, values for the estimates, and thinking about the context of the data to determine if I think a variable is useful.

I’ll detail my thoughts below:

  • We saw a fairly high correlation between the unit price of an item and the log of the total. The interaction term has a fairly low p-value, an influential estimate value, and the individual variables have a noticeable estimate and low p-values. Furthermore, it does make some logical sense for this to be used to predict rating: higher prices make customers unhappy.
  • I’m not liking the p-values of customer_type and payment_method and most of the interactions including it. They don’t seem significant and the estimate values are pretty small.
  • We may be getting somewhere with the gender and store interaction. I see some nice p-values and some larger estimates. This isn’t an interaction I was expecting but, if true, this could be very useful information to the company to help them with advertising.
  • Recommended practice for regression is, if including an interaction variable, you should also include the individual variables that make up that interaction, even if they are not significant or don’t have a large effect.

With all this being said, let’s make a reduced model!

mlr_mod_reduced <- lm_spec %>% 
  fit(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = train_subset)

tidy(mlr_mod_reduced) %>% 
  mutate(p.value = format.pval(p.value, digits = 4)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.7255485 0.7818191 7.3233673 5.969e-13
storeB 0.2086592 0.2138560 0.9756993 0.32951
storeC 0.3082288 0.2123498 1.4515141 0.14703
genderMale 0.3533736 0.2093694 1.6877995 0.09184
log_total 0.2111048 0.1555520 1.3571331 0.17513
unit_price 0.0206734 0.0139899 1.4777356 0.13988
storeB:genderMale -0.5281460 0.2975677 -1.7748764 0.07630
storeC:genderMale -0.4069470 0.2988545 -1.3616893 0.17368
log_total:unit_price -0.0037972 0.0025461 -1.4913614 0.13627

I’m liking this model better. We have similar results (low p-values and decently-sized estimates) and we are using less variables. Let’s stick with this model for the rest of the guide.

Assessing the Model

Now let’s assess the model and see if regression was a good choice. We’ll start with overall performance and then check the assumptions.

model_performance(mlr_mod_full) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3157.101 3157.978 3241.402 0.0127147 -0.0074855 1.706068 1.724512
model_performance(mlr_mod_reduced) %>% 
  kable()
AIC AICc BIC R2 R2_adjusted RMSE Sigma
3144.37 3144.65 3191.204 0.0086671 -0.0013717 1.709562 1.719272

We will look at the AIC and BIC as we did with SLR and I’d also like introduce the Root Mean Square Error (RMSE). The RMSE is the standard deviation of our prediction errors. We’d like this to be small as that means our errors are less spread out (and therefore smaller).

We can’t interpret any of these statistics alone; we use them to compare between models which is why I ran the model performance function on both our full model and our reduced model. We do see both the AIC and the BIC decrease slightly which is good. Our RMSE actually increases here but by a very little amount. My biggest takeaway is that the reduced model and the full model are very similar in terms of their performance. In cases like this, we would prefer to choose the model with the fewest variables. Why use more when we can do similarly with few?

Another thing to note is that our RMSE is pretty large compared to the range of the response variable, rating. Most of our errors are around 1.7 with some extending beyond that (bigger errors) and some less. Given that rating only spans from 4 to 10, an error of 1.7 could mean the difference between a neutral customer and a satisfied customer. Because the error is so big compared to the response, I don’t have much confidence in the accuracy of this model.

Now on to the assumptions. Most of our assumptions are the same as SLR; however, we do have one more which involves collinearity (correlation between predictors).

diagnostics <- plot(check_model(mlr_mod_reduced, panel = FALSE))
## Model has interaction terms. VIFs might be inflated.
##   You may check multicollinearity among predictors of a model without
##   interaction terms.

Linearity

First, can we assume linearity between the response and the predictor?

diagnostics[[2]] +
  my_theme

Predicted values versus residuals. Mostly random scatter with a few random points out in the left of the plot.

We are looking for a random scatter of points (and not a pattern). We do see a random scatter here with no general pattern. This plot looks much better than the one we saw for our SLR model. Even though there is a little lift of the line towards the lower end of the predicted values, I’m not concerned.

Constant Variability of Errors

diagnostics[[3]] +
  my_theme

Predicted values versus square root of standardized residuals. Most values of the predicted rating have similar variability but less variabtility is present at the tail ends.

As with SLR, we want to see a flat line which would indicate that our errors have a similar variability (spread along the y-axis) for all predicted values (the x-axis). We don’t want a rainbow or a cone shape and we want to see the reference line be flat. It is true that we have a larger clump of points above the solid line which is bowing the line up slightly but I don’t see much reason for concern here.

Normal Errors

We can make a histogram of the errors to check for the normality assumption.

train_aug <- augment(mlr_mod_reduced, new_data = train_subset)

ggplot(data = train_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25, fill = "#099392", color = "black") +
  xlab("Residuals") +
  my_theme

Once again, histograms can be slightly deceiving. This plot does show a decent bell-curve shape with some peaks in the center and less in the two ends. However, it would be nice to see less peaks around -2 and 2. Let’s also look at the diagnostic plot.

diagnostics[[6]] +
  my_theme

Sample points should lie flat along the horizontal line at 0. Points are above the line on the left and below the line on the right, making an S shape.

Not so great! The tail ends of the dots stray far from the line and the prediction intervals (the shaded grey part) have a very large range. Based on this plot, I would not say we have met this assumption and should question the results of our model.

Colinearity Between Predictors

We also should make sure that our predictor variables are not correlated with each other. Recall earlier how we did see some slight correlation between some of our variables and we created an interaction term to account for this. Using interaction terms is one way of helping to meet this assumption. Another way is to simply remove one of the correlated variables from the model.

We can use variance inflation factors to look at collinearity. However, you may have noticed above when we generated our diagnostic plots that a warning message was printed. It said

Model has interaction terms. VIFs might be inflated. You may check multicollinearity among predictors of a model without interaction terms.

R has recognized that we have an interaction term. Naturally, we can expect the interaction term and the variables that make up that interaction term to be correlated. In fact, if we look at the plot here,

diagnostics[[5]] +
  my_theme

High VIF for unit price and unit price times log of total, moderate VIF for store and store times gender.

check_collinearity(mlr_mod_reduced)
## Model has interaction terms. VIFs might be inflated.
##   You may check multicollinearity among predictors of a model without
##   interaction terms.
## # Check for Multicollinearity
## 
## Moderate Correlation
## 
##       Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  log_total 5.44 [ 4.81,  6.17]         2.33      0.18     [0.16, 0.21]
## 
## High Correlation
## 
##        Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  unit_price 36.92 [32.27, 42.25]         6.08      0.03     [0.02, 0.03]

we see that the interaction between the log of the total and the unit price as well as the unit price variable itself have some potential for high correlation with the other variables. This is a problem, but let’s try checking this assumption on the model without any interactions.

mlr_mod_noint <- lm_spec %>% 
  fit(rating ~ store + gender + log(total) + unit_price, data = train_subset)

plot(check_model(mlr_mod_noint, panel = FALSE))[[5]] +
  my_theme

Low VIF values for all variables when interaction is removed.

check_collinearity(mlr_mod_noint)
## # Check for Multicollinearity
## 
## Low Correlation
## 
##        Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  log(total) 1.58 [1.45, 1.75]         1.26      0.63     [0.57, 0.69]
##  unit_price 1.59 [1.46, 1.76]         1.26      0.63     [0.57, 0.69]

Removing our interactions has helped with our VIFs so I would consider this assumption to be met although I do want to question store due to the large margin of error on its VIF.

Extreme Observations

It’s also a good idea to look at any extreme observations. It is possible that one or a few observations are driving the analysis and are skewing our results.

diagnostics[[4]] +
  my_theme

No leverage exceeds 0.05 or the contour lines at 0.9 and -0.9.

We can look at the plot above to find observations with high leverage, or significantly different \(x\) values (log_total values). I do not see any points outside of the dashed lines so we are ok here. The plot uses Cook’s \(d\), or Cook’s distance, to quantify how influential a point is and we want to keep all values inside of the cutoff values for Cook’s \(d\) (the dashed lines).

Testing Data

Let’s also take a quick peek at our testing data. How well does the model perform?

test_aug <- augment(mlr_mod_reduced, new_data = test_subset)

ggplot(data = test_aug, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Errors vs. Predicted",
       subtitle = "MLR Reduced Model Testing Dataset") +
  xlab("Fitted values") +
  ylab("Errors") +
  my_theme

It looks like we still see a big range in our errors (between -3 and 3). We also see again that the fitted values range from 6.8 to about 7.3 which doesn’t not represent the rating variable accurately.

Wrap Up

All in all, regression was not the best model to use to predict rating. The variables in the dataset simply do not hold enough information in order to make good predictions. We saw very low \(R^2\) values, a high error rate, we violated an assumption, and our predictions do not reflect the full range of possible values for our response. Although we won’t move forward with this example, it would be a good idea to explore the variables more. Perhaps we could create a variable of our own to use in the model. Perhaps we need to collect more data (more observations and more variables) to improve our accuracy. Perhaps linear regression just isn’t the best model choice for predicting rating. There definitely is more work that can be done here to try and predict a customer’s rating of a transaction (and hence their satisfaction).

References

Dykes, Bradford. n.d. STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.
Paye, Aung. n.d. “Supermarket Sales.” Accessed April 18, 2024. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales.