Welcome

Hey there! This is the seventh guide in my Introduction to Modeling with R series. In the previous guide, we looked at generalized linear models and Poisson regression which involved a count response variable. This page will focus on bootstrapping, which is a method of resampling (with replacement). As will all the previous guides, we’ll start by importing our data. I’ll then discuss a little more about the bootstrapping process and then I’ll demonstrate how it works.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

library(tidyverse)
library(tidymodels)
library(GGally)
library(rsample)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv")) %>% 
  mutate(hour = hour(time)) %>% 
  select(-c(id, date, time))

Bootstrapping

What exactly is bootstrapping? In simple terms, it’s a way to study the population values of your model without doing a formal hypothesis test. Bootstrapping uses the data that you have already collected to create many, many samples, as if you had sampled your population many, many times to get many, many datasets. Remember, the purpose of modeling (and by extension, hypothesis tests) is to use a subset of a population to predict something about that population. We want to be able to use a sample to make educated guesses about features of a population. In our case, we want to use a subset of customers from a set of stores from a company to make predictions about how all customers from the all stores (in the company) shop.

In “traditional” statistics we can attempt to model the data using equations. We have seen some of these models earlier (SLR, MLR, logistic regression, etc.) and there are other options as well depending on what you are predicting and what your data looks like. Instead of using one model on one sample dataset, bootstrapping uses resampling to create many models on many sample datasets and summarizing the results. How does it do this? Well, it takes the original dataset and creates many samples from it.

Think about the primary dataset as a pool of possible values for bootstrapping to pull from. Here, the dataset named retail contains all of the possible values that could appear in a new dataset. When we bootstrap, we are pulling values from that pool to create another dataset of the same size. retail has 1000 observations and 11 variables. Each dataset that the bootstrapping process creates will also have 1000 observations and 11 variables. When creating a dataset, bootstrapping will randomly select a value from the pool for each variable.

For example, in our pool there are

retail %>% 
  select(store) %>% 
  group_by(store) %>% 
  count() %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), full_width = FALSE)
store n
A 340
B 332
C 328

340 “A”s, 332 “B”s, and 328 “C”s to choose from.

Let’s suppose we were manually doing bootstrapping and creating a new dataset ourselves. How would we do it? Well, let’s make one observation at a time. Our first observation’s first variable is store. To pick this observation’s store value, we will randomly select a store value from the pool. Each value in the pool has an equal chance of being selected.

set.seed(2024)
sample(retail$store, 1)
## [1] "C"

Alright, that is the value for our first observation’s store. Next we’ll grab a value for customer_type.

sample(retail$customer_type, 1)
## [1] "Normal"

And there is our value for customer_type.

We’ll continue this process for each variable. Then we’ll do it again for the second observation. And then again. We’ll do that process 1000 times in total to create a full dataset with new values. It is important to note that we will sample with replacement. Each value that we select for each observation is put back into the pool and could be selected again later in the dataset.

Once we have one dataset, we’ll do it again. And again. And again. Bootstrapping is powerful in that it does this dataset creation process for us and it can do it very quickly. We can create as many resampled datasets as we want but typically you will see the number of these datasets get into the thousands.

Once we have a collection of datasets, we will create a model for each of them. Essentially, we will pick a model (e.g., MLR, multinomial regression, etc.) and use that same model on each of the thousands of datasets. Then we look at the results of all of the models and see the variety of results that we get. We can then see how common certain results are and create a confidence interval for our model results. We can then use this confidence interval to see the likely values for a model of the true population values.

This was a long explanation with a lot of words. Let’s use bootstrapping on our dataset to see how it works.

Looking at Variables

It’s always a good idea to look at our variables first to get an idea of how they related to one another. We’ve done this a bit in previous guides, but let’s get in the habit of doing it.

retail %>% 
  ggpairs() +
  my_theme

Plot matrix of all pairs of varibles.

This makes a very large plot with hard-to-read axes but gives us a general idea of how our data interacts. We can also see the distributions of all of the variables. For more insights about this plot, see the first guide which goes over generalities regarding the data such as skewness and interactions.

Model and Variable Selection

Before actually starting to bootstrap, we have to pick a response variable and a corresponding model. Let’s revisit the model we made when covering multiple linear regression. There, we were trying to predict the rating a customer would give a transaction in an effort to determine what aspects of the transaction lead to negative vs. positive ratings. Our model wasn’t very good at predicting rating but what if that was because our specific dataset wasn’t very good? Let’s try to use bootstrapping and generate many datasets and see what our results are. First, we set up the model (we’ll use the reduced model we used in the MLR guide).

retail_subset <- retail %>% 
  mutate(log_total = log(total)) %>% 
  select(rating, store, gender, log_total, unit_price)

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

mlr_mod_reduced <- lm_spec %>% 
  fit(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = retail_subset)

tidy(mlr_mod_reduced) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.8190113 0.6956637 8.3646903 0.0000000
storeB 0.0397158 0.1908554 0.2080935 0.8351987
storeC 0.3231238 0.1865013 1.7325555 0.0834858
genderMale 0.3388920 0.1866309 1.8158411 0.0696968
log_total 0.2031692 0.1385256 1.4666546 0.1427873
unit_price 0.0227171 0.0125426 1.8112040 0.0704120
storeB:genderMale -0.4768790 0.2650869 -1.7989534 0.0723302
storeC:genderMale -0.5291815 0.2663658 -1.9866720 0.0472338
log_total:unit_price -0.0042275 0.0022734 -1.8596093 0.0632370

Now, we can create our many resampled datasets. The code below takes 2000 random samples (with replacement) from our original dataset to create 2000 new datasets of size 1000 x 5.

set.seed(2024)

# Generate the 2000 bootstrap samples
boot_samps <- retail_subset %>% 
  bootstraps(times = 2000)

This actually creates two datasets for each iteration. One is an assessment dataset and another is the analysis dataset. We will only be working with the analysis datasets here. We can view the first analysis dataset to see what it looks like.

boot_samps$splits[[1]] %>% 
  analysis() %>% 
  head() %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
rating store gender log_total unit_price
6.2 C Male 4.893607 31.77
4.2 B Female 5.600586 51.54
9.5 B Female 5.019159 72.04
8.0 C Male 6.931228 97.50
8.9 A Male 6.204073 58.90
6.8 A Female 6.662121 93.12

R has randomly selected values for each row. The pool of possible values it could select from was contained in the original dataset. This dataset is one of 2000 datasets created using our original dataset and represents a possible dataset we could have collected from the real world.

Ok, so now we have a ton of datasets. The next step is to apply our linear model onto each of those 2000 datasets. This will give us separate parameter estimates and performance metrics for each of the datasets. This step can take a second to run. Remember, we are creating model fits for 2000 models, which is quite a lot.

fit_model <- function(split) {
  
  lm(rating ~ store + gender + log_total + unit_price + store*gender + log_total*unit_price, data = analysis(split))
  
}

boot_models <- boot_samps %>% 
  mutate(
    model = map(splits, fit_model),
    coef_info = map(model, tidy)
    )

boots_coefs <- boot_models %>% 
  unnest(coef_info)

boots_coefs %>% 
  select(-c(splits, model)) %>% 
  head(n = 18) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
id term estimate std.error statistic p.value
Bootstrap0001 (Intercept) 6.9943424 0.7122561 9.8199828 0.0000000
Bootstrap0001 storeB -0.3371571 0.1948487 -1.7303534 0.0838785
Bootstrap0001 storeC -0.2271109 0.1963723 -1.1565323 0.2477422
Bootstrap0001 genderMale -0.0721000 0.1909760 -0.3775347 0.7058571
Bootstrap0001 log_total -0.0023181 0.1412043 -0.0164168 0.9869052
Bootstrap0001 unit_price 0.0109540 0.0126999 0.8625204 0.3886099
Bootstrap0001 storeB:genderMale 0.0721744 0.2698356 0.2674755 0.7891587
Bootstrap0001 storeC:genderMale 0.1497817 0.2772830 0.5401762 0.5891969
Bootstrap0001 log_total:unit_price -0.0015062 0.0023282 -0.6469506 0.5178137
Bootstrap0002 (Intercept) 6.5903414 0.6605340 9.9772929 0.0000000
Bootstrap0002 storeB -0.0607334 0.1923560 -0.3157346 0.7522705
Bootstrap0002 storeC 0.2305168 0.1908272 1.2079871 0.2273404
Bootstrap0002 genderMale 0.4044513 0.1824238 2.2170968 0.0268428
Bootstrap0002 log_total 0.0339787 0.1337317 0.2540812 0.7994855
Bootstrap0002 unit_price 0.0097506 0.0124286 0.7845285 0.4329176
Bootstrap0002 storeB:genderMale -0.3999014 0.2638026 -1.5159116 0.1298606
Bootstrap0002 storeC:genderMale -0.3755111 0.2652330 -1.4157780 0.1571547
Bootstrap0002 log_total:unit_price -0.0016329 0.0022585 -0.7230132 0.4698424

The output from boots_coefs contains all of the liner model output from each model and is 18,000 rows long. This is because each model has 9 rows (intercept through the interaction variables) and \(9*2000 = 18000\). I printed the first two models so we can see an example of what we are looking at. The parameter estimates and p-values are slightly different between each of the models. By taking 2000 samples, we are hoping that we get a good representation of all the possible datasets we could have collected.

However, the data in this form is not very helpful. Recall the purpose of modeling: we want to use data from a sample (selection of observations) to help predict or describe attributes about a population (full group of observations). Bootstrapping also attempts to do this. We have so many different models from many different datasets and we can use the values from each of the models to create a confidence interval. We can’t predict the population values exactly but we can use our 2000 models to get a good guess of where the population values fall. We’ll use the following code to generate a 95% confidence interval for each of our parameter estimates (i.e., one interval for the intercept, one for Store B, etc.).

boot_int <- int_pctl(boot_models, statistics = coef_info, alpha = 0.05)
boot_int %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term .lower .estimate .upper .alpha .method
(Intercept) 4.3959979 5.7979442 7.1124693 0.05 percentile
genderMale -0.0404578 0.3341792 0.7062825 0.05 percentile
log_total -0.0579252 0.2087550 0.4904392 0.05 percentile
log_total:unit_price -0.0087193 -0.0042508 0.0001455 0.05 percentile
storeB -0.3661242 0.0409941 0.4152146 0.05 percentile
storeB:genderMale -0.9768234 -0.4783746 0.0549766 0.05 percentile
storeC -0.0382787 0.3173136 0.6717086 0.05 percentile
storeC:genderMale -1.0640932 -0.5212512 -0.0018762 0.05 percentile
unit_price -0.0017764 0.0227762 0.0469491 0.05 percentile

So, for instance, we are 95% confident that if we had built a model on the true population, we would have gotten a paramter value for the log of total of between -0.0579 and 0.490. Similarily, we are 95% confident that if we had built a model on the true population we would have gotten a parameter value for the interaction between Store C and the male gender between -1.06 and -0.00188.

Something to notice here is that nearly all of the intervals include zero. This meanas that zero is a possible value we could have gotten from the population. A value of zero for a coefficient estimate would indicate that the respective variable has zero effect on the response! For instance, an estimate value of zero for Store B would mean that if a customer shopped at Store B, we would predict their transaction rating to not change at all. Essentially, since the intervals contain zero, we can’t rule out 0 as a potential value for the true population which means that we can’t be confident that any of these variables have much of an effect on rating. This certainly contributes to our poor model performance we saw in previous guides.

We can visualize these limits as well.

ggplot(boots_coefs, aes(x = estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ term, scales = "free") +
  geom_vline(data = boot_int, aes(xintercept = .lower), col = "#099392") +
  geom_vline(data = boot_int, aes(xintercept = .upper), col = "#099392") +
  labs(title = "Confidence Intervals for All Parameters in the Model") +
  my_theme

Histograms of confidence intervals for the values of all parameters in the model.

It is somewhat comforting to notice that 0 is, for many of the variables, at the tails of the distribution and is therefore less likely to obtain. However, it still is in the interval and this forces us to question the reliability of the variables in the dataset and the model as a whole.

All in all, the bootstrapping process has once again produced a model that isn’t very good at predicting rating. However, this is useful information to us. I have a few takeaways from this:

  • rating is more complex than just the variables in this dataset
  • The variables in this dataset interact with each other and should not be treated separately
  • We need more data and more variables in order to predict rating

All of these points, as well as our results from the previous guides, can be reported to the company to help them plan out future data collection projects and to help understand the complexity and variability of how a customer rates a transaction.

References

Dykes, Bradford. n.d. STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.
Frost, Jim. 2018. “Introduction to Bootstrapping in Statistics with an Example.” Statistics By Jim. http://statisticsbyjim.com/hypothesis-testing/bootstrapping/.