Hey there! This is the third guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple
regression cases so let’s get that set up first. As mentioned in the
previous, guide, I prefer to use the `tidyverse`

family of
packages. We’ll also be using the `tidymodels`

collection of
packages to set up the models and perform our train/test validation and
`GGally`

to make a scatterplot matrix.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (`strata = store`

).

```
library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
```

Alright! Now that we have that set, we’ll use the `train`

data for all the model creation and will only use the `test`

data when we are analyzing our models.

Earlier, we weren’t able to predict `rating`

with one
variable alone. What if we could use multiple variables in regression
model? Maybe then we can better predict `rating`

.

When can we use multiple linear regression (MLR)?

- The response variable (the variable we are predicting) must be continuous.
- The predictor variable (the variable we are using to predict) can be either continuous or categorical with any amount of levels.
- We only get one response variable but can have as many predictor variables as we wish.

We also still have to meet the regression assumptions:

- The relationship between the predictor variable and the response variables should be linear
- The errors of our predictions should be approximately normally distributed
- There should be a constant variability in our errors
- NEW: Our predictor variables should not be correlated with each other
- Our errors should be independent (we will not focus on this here)

The tricky part about working with MLR is that you want to find the
right balance of variables. We *do* want to choose variables we
think might have an effect on `rating`

but we *don’t*
want to include any unnecessary variables or variables that don’t
provide much benefit to the model.

As a result, MLR is an iterative process. I like to start out by creating a full model that has every single predictor variable that I think could be useful in predicting my response variable. I then use the output from the model to try and pull out the variables that are not pulling their weight.

So let’s start with that full model. These are the variables that I think might be useful (remember that in MLR we can use categorical variables):

- Store
- Customer Type
- Gender
- Total
- Unit Price
- Pay Method

It is also important to think about how these predictor variables interact with each other. We should consider adding in any interaction terms if we think two variables influence each other. I think it is possible that customer member status might influence their pay method and perhaps gender and purchase total interact. However, we shouldn’t rely solely on our instinct. Before jumping right into making a model, we should plot the variables and visualize any possible interactions.

```
train_subset <- train %>%
select(rating, store, customer_type, gender, total, unit_price, pay_method)
test_subset <- test %>%
select(rating, store, customer_type, gender, total, unit_price, pay_method)
```

`ggpairs(train_subset)`

This is a large plot that shows the relationships between each pair of variables. There isn’t a lot that stands out here but I do want to notice a few things:

`rating`

, our response variable, is approximately normal`total`

is skewed right- There is a moderately strong correlation between
`unit_price`

and`total`

`customer_type`

and`gender`

may be related`store`

and`gender`

may be related`store`

and`pay_method`

may be related

Based on those observations I would like to make a few changes. I
will be taking the log of `total`

to help correct the skew
and I will be adding interaction variables between the related pairs
mentioned above.

Because we are still working in linear regression, we can use the
same `lm_spec`

engine from SLR to drive our model building.
This time, we’ll use a different formula to fit the model which will
include the log of total and some interaction terms.

```
train_subset <- train_subset %>%
mutate(log_total = log(total))
test_subset <- test_subset %>%
mutate(log_total = log(total))
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
mlr_mod_full <- lm_spec %>%
fit(rating ~ store + customer_type + gender + log_total + unit_price + pay_method + unit_price*log_total + customer_type*gender + store*gender + store*pay_method, data = train_subset)
```

Once we have done that, we can look at the output:

```
tidy(mlr_mod_full) %>%
mutate(p.value = format.pval(p.value, digits = 4)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 5.7370356 | 0.8000191 | 7.1711235 | 1.726e-12 |

storeB | 0.4076251 | 0.3028263 | 1.3460690 | 0.17867 |

storeC | 0.2997884 | 0.2909336 | 1.0304357 | 0.30312 |

customer_typeNormal | 0.1256181 | 0.1741896 | 0.7211577 | 0.47103 |

genderMale | 0.4039030 | 0.2456883 | 1.6439652 | 0.10059 |

log_total | 0.1875390 | 0.1572396 | 1.1926951 | 0.23335 |

unit_price | 0.0187970 | 0.0141310 | 1.3301941 | 0.18384 |

pay_methodCredit card | 0.0192168 | 0.2636199 | 0.0728958 | 0.94191 |

pay_methodEwallet | 0.0897699 | 0.2521303 | 0.3560458 | 0.72190 |

log_total:unit_price | -0.0034288 | 0.0025746 | -1.3317756 | 0.18332 |

customer_typeNormal:genderMale | -0.0946899 | 0.2452180 | -0.3861457 | 0.69949 |

storeB:genderMale | -0.5277196 | 0.2992870 | -1.7632560 | 0.07825 |

storeC:genderMale | -0.3982474 | 0.3012989 | -1.3217686 | 0.18663 |

storeB:pay_methodCredit card | -0.1173402 | 0.3729626 | -0.3146166 | 0.75314 |

storeC:pay_methodCredit card | 0.1632780 | 0.3710021 | 0.4400999 | 0.65999 |

storeB:pay_methodEwallet | -0.4414774 | 0.3663241 | -1.2051553 | 0.22851 |

storeC:pay_methodEwallet | -0.1200978 | 0.3607408 | -0.3329201 | 0.73928 |

We see that R has created dummy variables for us for each of our
categorical variables. However, because of this, we have a very large
number of predictor variables. Our goal is to reduce the amount of
predictor variables down to the essentials. We want to predict
`rating`

with the most accuracy but with the fewest variables
we can.

One way to judge this is to look at the p-values for each predictor which essentially looks at if the addition of one variable contributes something to the model (by comparing a model with the variable to one without). However, I would caution against relying solely on p-values. They are great to help us get an idea of the tendencies of the model and can give us a direction about where to go next but I personally say that they do not and in most cases should not drive decision making. I will not be ignoring p-values but I also will not be using them as a hard and fast rule for including or not including a variable.

For instance, look at our model. Not a single variable is
significant, at the typical cutoff of 0.05. To me, this does not mean
that we should scratch every variable and try again. It does not mean
that these variables are useless at predicting `rating`

. This
indicates to me that maybe we do not have enough information in the data
to predict `rating`

accurately or maybe we have too many
variables.

I think it’s also important to look at the estimates. Some of the
estimates have larger values, like Store B. Here, if a certain
transaction was done at Store B and all other variables were held
constant, we would predict the rating of a transaction to increase by
0.411. Other estimates have very small values, such as the interaction
between `unit_price`

and `total`

. With this term,
for every dollar increase of the product of `unit_price`

and
`total`

, we would expect the predicted rating to decrease by
0.00343 points. This does not necessarily mean that this interaction
variable has a marginally small effect on rating. Remember that the
values for the log of total and unit price can be pretty high and can
influence the prediction. The highest `unit_price`

value in
`train_subset`

is 99.96 with a corresponding total of 735.
Multiplying the log of the total by the unit price gives 659.7231. If we
insert that into our model, we can say that for this observation we
would expect the predicted rating to decrease by \(659.7231*0.00343 = 2.263\) points!

So it’s important to consider the practical piece of your model. Even if a term doesn’t meet the cutoff for its p-value, it may be close enough to still merit inclusion in the final model, especially if you think that it makes sense to include in the context of the dataset.

In the end, it is up to you which variables to use. There are variable selection procedures out there (forward selection, backward selection, and stepwise selection) that can help you with this if you choose to use them. You may also be interested in dimension reduction techniques such as Principal Components Analysis (PCA) and other factor analysis procedures. These are all out of the scope of this guide but can be useful in certain cases. I would recommend treating each method with some skepticism as I believe that no method will be able to tell you with 100% certainty which variables to include (and some of the methods produce models that are tricky to interpret). With the model here, I will be using my judgement by looking at the p-values, values for the estimates, and thinking about the context of the data to determine if I think a variable is useful.

I’ll detail my thoughts below:

- We saw a fairly high correlation between the unit price of an item and the log of the total. The interaction term has a fairly low p-value, an influential estimate value, and the individual variables have a noticeable estimate and low p-values. Furthermore, it does make some logical sense for this to be used to predict rating: higher prices make customers unhappy.
- I’m not liking the p-values of
`customer_type`

and`payment_method`

and most of the interactions including it. They don’t seem significant and the estimate values are pretty small. - We may be getting somewhere with the gender and store interaction. I see some nice p-values and some larger estimates. This isn’t an interaction I was expecting but, if true, this could be very useful information to the company to help them with advertising.
- Recommended practice for regression is, if including an interaction variable, you should also include the individual variables that make up that interaction, even if they are not significant or don’t have a large effect.

With all this being said, let’s make a reduced model!

```
mlr_mod_reduced <- lm_spec %>%
fit(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = train_subset)
tidy(mlr_mod_reduced) %>%
mutate(p.value = format.pval(p.value, digits = 4)) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 5.7255485 | 0.7818191 | 7.3233673 | 5.969e-13 |

storeB | 0.2086592 | 0.2138560 | 0.9756993 | 0.32951 |

storeC | 0.3082288 | 0.2123498 | 1.4515141 | 0.14703 |

genderMale | 0.3533736 | 0.2093694 | 1.6877995 | 0.09184 |

log_total | 0.2111048 | 0.1555520 | 1.3571331 | 0.17513 |

unit_price | 0.0206734 | 0.0139899 | 1.4777356 | 0.13988 |

storeB:genderMale | -0.5281460 | 0.2975677 | -1.7748764 | 0.07630 |

storeC:genderMale | -0.4069470 | 0.2988545 | -1.3616893 | 0.17368 |

log_total:unit_price | -0.0037972 | 0.0025461 | -1.4913614 | 0.13627 |

I’m liking this model better. We have similar results (low p-values and decently-sized estimates) and we are using less variables. Let’s stick with this model for the rest of the guide.

Now let’s assess the model and see if regression was a good choice. We’ll start with overall performance and then check the assumptions.

```
model_performance(mlr_mod_full) %>%
kable()
```

AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|

3157.101 | 3157.978 | 3241.402 | 0.0127147 | -0.0074855 | 1.706068 | 1.724512 |

```
model_performance(mlr_mod_reduced) %>%
kable()
```

AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|

3144.37 | 3144.65 | 3191.204 | 0.0086671 | -0.0013717 | 1.709562 | 1.719272 |

We will look at the AIC and BIC as we did with SLR and I’d also like introduce the Root Mean Square Error (RMSE). The RMSE is the standard deviation of our prediction errors. We’d like this to be small as that means our errors are less spread out (and therefore smaller).

We can’t interpret any of these statistics alone; we use them to compare between models which is why I ran the model performance function on both our full model and our reduced model. We do see both the AIC and the BIC decrease slightly which is good. Our RMSE actually increases here but by a very little amount. My biggest takeaway is that the reduced model and the full model are very similar in terms of their performance. In cases like this, we would prefer to choose the model with the fewest variables. Why use more when we can do similarly with few?

Another thing to note is that our RMSE is pretty large compared to
the range of the response variable, `rating`

. Most of our
errors are around 1.7 with some extending beyond that (bigger errors)
and some less. Given that `rating`

only spans from 4 to 10,
an error of 1.7 could mean the difference between a neutral customer and
a satisfied customer. Because the error is so big compared to the
response, I don’t have much confidence in the accuracy of this
model.

Now on to the assumptions. Most of our assumptions are the same as SLR; however, we do have one more which involves collinearity (correlation between predictors).

`diagnostics <- plot(check_model(mlr_mod_reduced, panel = FALSE))`

```
## Model has interaction terms. VIFs might be inflated.
## You may check multicollinearity among predictors of a model without
## interaction terms.
```

First, can we assume linearity between the response and the predictor?

```
diagnostics[[2]] +
my_theme
```