Hey there! This is the seventh guide in my Introduction to Modeling with R series. In the previous guide, we looked at generalized linear models and Poisson regression which involved a count response variable. This page will focus on bootstrapping, which is a method of resampling (with replacement). As will all the previous guides, we’ll start by importing our data. I’ll then discuss a little more about the bootstrapping process and then I’ll demonstrate how it works.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
library(tidyverse)
library(tidymodels)
library(GGally)
library(rsample)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv")) %>%
mutate(hour = hour(time)) %>%
select(-c(id, date, time))
What exactly is bootstrapping? In simple terms, it’s a way to study the population values of your model without doing a formal hypothesis test. Bootstrapping uses the data that you have already collected to create many, many samples, as if you had sampled your population many, many times to get many, many datasets. Remember, the purpose of modeling (and by extension, hypothesis tests) is to use a subset of a population to predict something about that population. We want to be able to use a sample to make educated guesses about features of a population. In our case, we want to use a subset of customers from a set of stores from a company to make predictions about how all customers from the all stores (in the company) shop.
In “traditional” statistics we can attempt to model the data using equations. We have seen some of these models earlier (SLR, MLR, logistic regression, etc.) and there are other options as well depending on what you are predicting and what your data looks like. Instead of using one model on one sample dataset, bootstrapping uses resampling to create many models on many sample datasets and summarizing the results. How does it do this? Well, it takes the original dataset and creates many samples from it.
Think about the primary dataset as a pool of possible values for
bootstrapping to pull from. Here, the dataset named retail
contains all of the possible values that could appear in a new dataset.
When we bootstrap, we are pulling values from that pool to create
another dataset of the same size. retail
has 1000
observations and 11 variables. Each dataset that the bootstrapping
process creates will also have 1000 observations and 11 variables. When
creating a dataset, bootstrapping will randomly select a value from the
pool for each variable.
For example, in our pool there are
retail %>%
select(store) %>%
group_by(store) %>%
count() %>%
kable() %>%
kable_styling(c("striped", "responsive"), full_width = FALSE)
store | n |
---|---|
A | 340 |
B | 332 |
C | 328 |
340 “A”s, 332 “B”s, and 328 “C”s to choose from.
Let’s suppose we were manually doing bootstrapping and creating a new
dataset ourselves. How would we do it? Well, let’s make one observation
at a time. Our first observation’s first variable is store
.
To pick this observation’s store
value, we will randomly
select a store
value from the pool. Each value in the pool
has an equal chance of being selected.
set.seed(2024)
sample(retail$store, 1)
## [1] "C"
Alright, that is the value for our first observation’s
store
. Next we’ll grab a value for
customer_type
.
sample(retail$customer_type, 1)
## [1] "Normal"
And there is our value for customer_type
.
We’ll continue this process for each variable. Then we’ll do it again for the second observation. And then again. We’ll do that process 1000 times in total to create a full dataset with new values. It is important to note that we will sample with replacement. Each value that we select for each observation is put back into the pool and could be selected again later in the dataset.
Once we have one dataset, we’ll do it again. And again. And again. Bootstrapping is powerful in that it does this dataset creation process for us and it can do it very quickly. We can create as many resampled datasets as we want but typically you will see the number of these datasets get into the thousands.
Once we have a collection of datasets, we will create a model for each of them. Essentially, we will pick a model (e.g., MLR, multinomial regression, etc.) and use that same model on each of the thousands of datasets. Then we look at the results of all of the models and see the variety of results that we get. We can then see how common certain results are and create a confidence interval for our model results. We can then use this confidence interval to see the likely values for a model of the true population values.
This was a long explanation with a lot of words. Let’s use bootstrapping on our dataset to see how it works.
It’s always a good idea to look at our variables first to get an idea of how they related to one another. We’ve done this a bit in previous guides, but let’s get in the habit of doing it.
retail %>%
ggpairs() +
my_theme