Hey there! This is the second guide in my Modeling Guide with R series. In the previous guide, we performed some preliminary explorations on our dataset and took a look at a few of the variables. We cleaned the dataset up a bit and exported it so we could read it again in later guides.

This page will focus on both simple and multiple linear regression. Although, I cannot cover all of the details and fine use cases of linear regression, we will explore some of the key ideas to focus on when creating and analyzing a linear regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. In both simple and linear regression, we’ll use train/test validation and we’ll also look at several candidate models.

We will be using the same dataset for both the simple and multiple
regression cases so let’s get that set up first. As mentioned in the
previous, guide, I prefer to use the `tidyverse`

family of
packages. We’ll also be using the `tidymodels`

collection of
packages to set up the models and perform our train/test validation and
`GGally`

to make a scatterplot matrix. The
`performance`

package helps us check the model after we’ve
created it and the others are helpful for rendering good tables.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (`strata = store`

).

```
library(tidyverse)
library(tidymodels)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
```

Alright! Now that we have that set, we’ll use the `train`

data for all the model creation and will only use the `test`

data when we are analyzing our models.

Although I’ll avoid any complex math throughout this guide, it is important to think about how regression works. The formula for simple linear regression for a population is

\[ y = \beta_0 + \beta_1 x_1 + \varepsilon \]

which states that we can represent the relationship between our predictor variable and our response variable as a straight line plus some natural error (\(\varepsilon\)). The problem is, we don’t know the information about our population (i.e., we don’t have data from all transactions from all stores). Instead, we have a subset of that data (some transactions froms ome stores) and are looking to predict the population values using our sample. As a result, we are creating a model of the form

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] There is a difference here. First, we’ve lost the error term since we are now in prediction mode (there is no error since we know the data we collected). You also now see hats above a few of the values. This indicates that they are estimates of the true values.

When might we want to use SLR? Well, first we need to make sure we have the correct variable types:

- The response variable (the variable we are predicting) must be continuous.
- The predictor variable (the variable we are using to predict) can be either continuous or categorical with two levels.
- We only get one predictor variable and one response variable.

Second, we have to meet certain criteria. See, simple linear regression can be somewhat limiting as it requires us to make certain assumptions about our data and the population we are hoping to model due to the mathematical formulas and matrix algebra used. This is what we should pay attention to:

- The relationship between the predictor variable and the response variable(s) should be linear
- The errors of our predictions should be approximately normally distributed
- There should be a constant variability in our errors
- Our errors should be independent (we will not focus on this here)

These will be explored later on when we investigate our variables.

Given all of that information, let’s choose the variables we want to
use in the model. Suppose we want to have a way to predict the rating of
a transaction. The `rating`

variable is continuous so it is
acceptable to use as a response variable. To try and predict
`rating`

, we will consider a few candidate models (each with
one predictor variable). Let’s start with `total`

, and
`unit_price`

as possible predictors.

At this point, it’s a good idea to write down your predictions about
what might happen. This is important as you want to make sure you don’t
create a prediction based on your results so you don’t create any bias.
At this point, I am thinking that `unit_price`

will have the
most effect on `rating`

since I think that when items cost
more, people are more likely to be grumpy and give a lower rating.

Using the formula above, we can create two possible models that we
want to use to predict `rating`

:

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{total} \]

\[ \widehat{\text{rating}} = \hat{\beta_0} + \hat{\beta_1} * \text{unit_price} \] In each case, we are trying to find the values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using our predictor variable.

Let’s build the model! We will use the `tidymodels`

family
of packages for this. We first set up the regression engine, then fit
the models with our data. Based on the work done in the first guide, we
saw that `total`

has some right skew to it due to some
outliers to the right. As a result, we are going to take the log of
`total`

.

```
train_subset <- train %>%
select(rating, unit_price, total) %>%
mutate(log_total = log(total))
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
slr_mod_price <- lm_spec %>%
fit(rating ~ unit_price, data = train_subset)
slr_mod_total <- lm_spec %>%
fit(rating ~ log_total, data = train_subset)
```

Above, I fit two different models, one with `unit_price`

and one with `total`

. Let’s take a look at the output for
both of them.

```
tidy(slr_mod_price) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 6.987286 | 0.1422717 | 49.1122698 | 0.0000000 |

unit_price | 0.000548 | 0.0023023 | 0.2380066 | 0.8119372 |

```
tidy(slr_mod_total) %>%
kable() %>%
kable_styling(c("striped", "responsive"))
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 6.9220710 | 0.366991 | 18.8616934 | 0.0000000 |

log_total | 0.0176607 | 0.066701 | 0.2647747 | 0.7912515 |

We can use this output to create a regression line equation using the above formulas:

\[ \widehat{\text{rating}} = 6.99 + 0.000548 * \text{unit price} \]

\[
\widehat{\text{rating}} = 6.92 + 0.0177 * \log{\text{(total)}}
\] In other words, because the slopes for both of the lines are
very small, we have very little evidence of a linear relationship
between `rating`

and both `unit_price`

and the log
of `total`

.

Let’s see how the models perform! My guess is not very well based on the information so far.

```
model_performance(slr_mod_price) %>%
kable()
```

AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|

3137.269 | 3137.299 | 3151.319 | 7.11e-05 | -0.0011835 | 1.716958 | 1.71911 |

```
model_performance(slr_mod_total) %>%
kable()
```

AIC | AICc | BIC | R2 | R2_adjusted | RMSE | Sigma |
---|---|---|---|---|---|---|

3137.255 | 3137.286 | 3151.305 | 8.8e-05 | -0.0011666 | 1.716943 | 1.719096 |

We can get a good idea of this by looking at \(R^2\), or the proportion of the variance in
`rating`

we have explained. When we use both
`unit_price`

and `total`

, we can explain less than
1% of the variance in `rating`

. Not so good! There is very
little difference between the two models. The AIC, or Akaike’s
information criteria assess our error. We want to see it decrease as we
look at different models. The same goes for the BIC, or Bayesian
information criterion. We also want to see \(R^2\) increase.

Due to the similar models, I will be finishing the SLR section with
only one model: the one with `log(total)`

which does seem to
be the slightly “better” model here.

The `performance`

packages offers us a nice glance at some
plots to help answer if we have met the assumptions.

`diagnostics <- plot(check_model(slr_mod_total, panel = FALSE))`

First, can we assume linearity between the response and the predictor?

```
diagnostics[[2]] +
my_theme
```

Well, this is a great example of how most real life data is not going to be perfectly linear. This data certainly shows no linear pattern; in fact, this looks more like a random scatter with no pattern than anything else. It’s close, but I wouldn’t say that we have satisfied this condition. We can see as well that the reference line in the plot is not flat for lower fitted values.

```
cor(train$rating, log(train$total)) %>%
round(5) %>%
kable() %>%
kable_styling(full_width = FALSE)
```

x |
---|

0.00938 |

We can also see that the correlation between the two variables is
very close to zero indicating that there is low evidence of a strong
relationship between `rating`

and the log of
`total`

.

```
diagnostics[[3]] +
my_theme
```