Hey there! This is the fourth guide in my Modeling Guide with R series. In the previous guide, we looked at multiple linear regression. This page will focus on logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As with SLR and MLR, we’ll use train/test validation and we’ll also look at a couple candidate models.
We will be using the same dataset we have been using which covers
transaction data. As mentioned in the previous, guide, I prefer to use
the tidyverse
family of packages. We’ll also be using the
tidymodels
collection of packages to set up the models and
perform our train/test validation.
The code below imports our packages and the data we cleaned in the first guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(GGally)
library(gvsu215)
library(car)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
Although most of the processes for logistic regression will be the same as or very similar to SLR and MLR, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. But what if you want to predict a variable that isn’t continuous? Well, logistic regression may come to the rescue.
In order to get valid results from logistic regression, we must meet certain assumptions.
First, let’s decide what we want to predict. Again, we need a
variable that has two levels. We have two options:
customer_type
and gender
. I would like to
focus on customer type and see if we can predict the probability that a
certain customer is a member or not given information about their
transaction. Do members shop differently than non members? I am inclined
to say so.
We should also define our “base” level for customer type or which type we want to compare everything to. I am going to create a true binary variable for gender which takes on a value of 1 if a customer is a member and 0 if they are not. We also need to coerce this variable into an R factor in order for the code later on to work.
train <- train %>%
mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))
test <- test %>%
mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))
Technically, we can split up logistic regression into simple logistic regression (one predictor variable) and multiple logistic regression (more than one predictor variable). However, I am going to jump right into multiple logistic regression to create a more detailed model. As I did with MLR, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.
I would like to start with:
store
gender
product_class
pay_method
rating
total
Before beginning, we should look at our variables to see if we need to apply any transformations or add any interaction variables.
train_subset <- train %>%
select(member, store, gender, product_class, pay_method, rating, total)
test_subset <- test %>%
select(member, store, gender, product_class, pay_method, rating, total)
ggpairs(train_subset) +
my_theme
My main observation here is that total
is skewed right
which is causing some outliers with the other variables. The values for
total
can also get quite large. If we’re not careful, these
large values could influence our results.
Here is that distribution up close:
train_subset %>%
ggplot(aes(total)) +
geom_histogram(binwidth = 20, color = "black", fill = "#099392") +
labs(x = "Total ($)",
y = "Count",
title = "Distribution of Transaction Totals") +
my_theme
Previously we used the log to help correct this. Let’s try something
different here and take the square root of total
to help
bring down those extreme values.
train_subset %>%
ggplot(aes(sqrt(total))) +
geom_histogram(color = "black", fill = "#099392") +
labs(x = "Total",
y = "Count",
title = "Distribution of the Square Root of Transaction Totals") +
my_theme