Welcome

Hey there! This is the fourth guide in my Modeling Guide with R series. In the previous guide, we looked at multiple linear regression. This page will focus on logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As with SLR and MLR, we’ll use train/test validation and we’ll also look at a couple candidate models.

We will be using the same dataset we have been using which covers transaction data. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation.

The code below imports our packages and the data we cleaned in the first guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(gvsu215)
library(car)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Logistic Regression

Although most of the processes for logistic regression will be the same as or very similar to SLR and MLR, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. But what if you want to predict a variable that isn’t continuous? Well, logistic regression may come to the rescue.

Requirements

In order to get valid results from logistic regression, we must meet certain assumptions.

  • The response variable must be categorical with two levels, no more, no less. We might also call this a binary variable
  • All of the observations should be independent
  • Our predictor variable(s) should not be correlated with each other
  • We must have a linear relationship between the independent variable(s) and the log odds (more on this later)
  • All possible combinations of variables in the model must be represented

Selecting Variables

First, let’s decide what we want to predict. Again, we need a variable that has two levels. We have two options: customer_type and gender. I would like to focus on customer type and see if we can predict the probability that a certain customer is a member or not given information about their transaction. Do members shop differently than non members? I am inclined to say so.

We should also define our “base” level for customer type or which type we want to compare everything to. I am going to create a true binary variable for gender which takes on a value of 1 if a customer is a member and 0 if they are not. We also need to coerce this variable into an R factor in order for the code later on to work.

train <- train %>% 
  mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))

test <- test %>% 
  mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))

Technically, we can split up logistic regression into simple logistic regression (one predictor variable) and multiple logistic regression (more than one predictor variable). However, I am going to jump right into multiple logistic regression to create a more detailed model. As I did with MLR, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.

I would like to start with:

  • store
  • gender
  • product_class
  • pay_method
  • rating
  • total

Exploring the Variables

Before beginning, we should look at our variables to see if we need to apply any transformations or add any interaction variables.

train_subset <- train %>% 
  select(member, store, gender, product_class, pay_method, rating, total)

test_subset <- test %>% 
  select(member, store, gender, product_class, pay_method, rating, total)

ggpairs(train_subset) +
  my_theme

Plot matrix of each variable plotted against the others.

My main observation here is that total is skewed right which is causing some outliers with the other variables. The values for total can also get quite large. If we’re not careful, these large values could influence our results.

Here is that distribution up close:

train_subset %>% 
  ggplot(aes(total)) +
  geom_histogram(binwidth = 20, color = "black", fill = "#099392") +
  labs(x = "Total ($)",
       y = "Count",
       title = "Distribution of Transaction Totals") +
  my_theme

Skewed right distribution of total. Median around $200, max around $1100.

Previously we used the log to help correct this. Let’s try something different here and take the square root of total to help bring down those extreme values.

train_subset %>% 
  ggplot(aes(sqrt(total))) +
  geom_histogram(color = "black", fill = "#099392") +
  labs(x = "Total",
       y = "Count",
       title = "Distribution of the Square Root of Transaction Totals") +
  my_theme