Welcome

Hey there! This is the fourth guide in my Modeling Guide with R series. In the previous guide, we looked at multiple linear regression. This page will focus on logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As with SLR and MLR, we’ll use train/test validation and we’ll also look at a couple candidate models.

We will be using the same dataset we have been using which covers transaction data. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation.

The code below imports our packages and the data we cleaned in the first guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(GGally)
library(gvsu215)
library(car)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Logistic Regression

Although most of the processes for logistic regression will be the same as or very similar to SLR and MLR, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. But what if you want to predict a variable that isn’t continuous? Well, logistic regression may come to the rescue.

Requirements

In order to get valid results from logistic regression, we must meet certain assumptions.

  • The response variable must be categorical with two levels, no more, no less. We might also call this a binary variable
  • All of the observations should be independent
  • Our predictor variable(s) should not be correlated with each other
  • We must have a linear relationship between the independent variable(s) and the log odds (more on this later)
  • All possible combinations of variables in the model must be represented

Selecting Variables

First, let’s decide what we want to predict. Again, we need a variable that has two levels. We have two options: customer_type and gender. I would like to focus on customer type and see if we can predict the probability that a certain customer is a member or not given information about their transaction. Do members shop differently than non members? I am inclined to say so.

We should also define our “base” level for customer type or which type we want to compare everything to. I am going to create a true binary variable for gender which takes on a value of 1 if a customer is a member and 0 if they are not. We also need to coerce this variable into an R factor in order for the code later on to work.

train <- train %>% 
  mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))

test <- test %>% 
  mutate(member = as.factor(ifelse(customer_type == "Member", 1, 0)))

Technically, we can split up logistic regression into simple logistic regression (one predictor variable) and multiple logistic regression (more than one predictor variable). However, I am going to jump right into multiple logistic regression to create a more detailed model. As I did with MLR, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.

I would like to start with:

  • store
  • gender
  • product_class
  • pay_method
  • rating
  • total

Exploring the Variables

Before beginning, we should look at our variables to see if we need to apply any transformations or add any interaction variables.

train_subset <- train %>% 
  select(member, store, gender, product_class, pay_method, rating, total)

test_subset <- test %>% 
  select(member, store, gender, product_class, pay_method, rating, total)

ggpairs(train_subset) +
  my_theme

Plot matrix of each variable plotted against the others.

My main observation here is that total is skewed right which is causing some outliers with the other variables. The values for total can also get quite large. If we’re not careful, these large values could influence our results.

Here is that distribution up close:

train_subset %>% 
  ggplot(aes(total)) +
  geom_histogram(binwidth = 20, color = "black", fill = "#099392") +
  labs(x = "Total ($)",
       y = "Count",
       title = "Distribution of Transaction Totals") +
  my_theme

Skewed right distribution of total. Median around $200, max around $1100.

Previously we used the log to help correct this. Let’s try something different here and take the square root of total to help bring down those extreme values.

train_subset %>% 
  ggplot(aes(sqrt(total))) +
  geom_histogram(color = "black", fill = "#099392") +
  labs(x = "Total",
       y = "Count",
       title = "Distribution of the Square Root of Transaction Totals") +
  my_theme

Approximately normal distribution of the square root of total.

That looks better! This is not ideal as it affects our interpretation of results but it does help to fix the skewness. I don’t see much evidence for creating an interaction variable for this model. However, I originally considered making one for store vs. gender and/or pay method vs. gender. I don’t think they’ll be necessary but I will include them in our first model nonetheless.

Building the Model

Before we actually build the model, let’s take a look at how logistic regression differs from SLR and MLR.

Instead of predicting the value of a continuous response, we are measuring (predicting) the probability of being in a certain class, here whether a customer is a member or not. Since we choose to have “Member” be represented by a 1, we are essentially finding the odds of a certain customer being a member, given the values for the other variables. We can define the odds of a “success” (a customer being member) as:

\[ \frac{Pr(Y_i = 1)}{Pr(Y_i = 0)} = \frac{p_i}{1-p_i} \] which is the probability of a given person being a member divided by one minus that same probability.

The math behind predicting this can get pretty involved. However, I would like to show that if we take the log of the above odds, we get a formula like this (which is derived from the more complicated logistic equation):

\[ \begin{equation*} \log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_1 + ... \end{equation*} \]

This should look slightly familiar to linear regression! We have a list of predictors and we would like to predict the parameters that allow us to predict how likely it is that a customer is a member. This response is formally called the “log odds” or “logit”.

To build the model, we use similar code as before. This time, we are using the logistic_reg() function rather than the linear_reg() function.

log_spec <- logistic_reg() %>%
  set_engine("glm")

log_mod_full <- log_spec %>%
  fit(member ~ store + gender + product_class + pay_method + rating + store*gender + pay_method*gender, data = train_subset, family = "binomial")

tidy(log_mod_full) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) -0.1695912 0.3994688 -0.4245419 0.6711706
storeB 0.2453408 0.2515595 0.9752794 0.3294217
storeC 0.2404212 0.2501605 0.9610677 0.3365181
genderMale 0.0203514 0.3171467 0.0641702 0.9488347
product_classFashion accessories 0.0552697 0.2425452 0.2278738 0.8197443
product_classFood and beverages 0.3305222 0.2451912 1.3480184 0.1776525
product_classHealth and beauty 0.1288623 0.2520721 0.5112120 0.6092026
product_classHome and lifestyle 0.2538179 0.2497428 1.0163173 0.3094783
product_classSports and travel 0.3327949 0.2463026 1.3511623 0.1766434
pay_methodCredit card 0.2423094 0.2484766 0.9751796 0.3294712
pay_methodEwallet -0.0155909 0.2475790 -0.0629733 0.9497878
rating -0.0226503 0.0419418 -0.5400406 0.5891690
storeB:genderMale -0.3077068 0.3501435 -0.8788020 0.3795086
storeC:genderMale -0.2153436 0.3516872 -0.6123157 0.5403289
genderMale:pay_methodCredit card 0.0563912 0.3549803 0.1588573 0.8737813
genderMale:pay_methodEwallet -0.0468723 0.3479201 -0.1347214 0.8928321

The tricky part about logistic regression is the interpretation. Since are predicting the log odds of obtaining a customer that is a member, each value in the estimate column above can be interpreted as the change in the log odds of finding a member for a unit increase in the respective \(X\) variable.

So if a customer shops at Store C, their predicted log odds of being a member increases by 0.240 and therefore their predicted odds of being a member increases by \(e^{0.240} = 1.271\).

One thing to notice here is that none of the p-values for the estimates are statistically significant at the “typically” 0.05 level. I mentioned this in the previous guide, but it is important to note the difference between statistical significance and practical significance. It may be that we don’t meet the 0.05 requirement, but sometimes, the p-value may be small enough to still mean something to us. At this moment, we are looking at the different variables that may help us predict if someone is a member or not. We don’t necessarily need to have statistical significance as long as we are still getting good insight into our question.

We can also explore our variables a little bit further. Using the following plot, we can look at the distribution of each of our variables separated by the member status of a customer. I will be dropping the interaction variables since they didn’t seem to have much of an effect, as expected.

ggbivariate(train, outcome = "customer_type", explanatory = c("store", "gender", "product_class", "pay_method", "rating")) +
  scale_fill_brewer(palette = "Dark2") +
  my_theme

My biggest takeaway here is that we have a pretty even split between members and non-members. In other words, both members and non-members engage in similar transactions. It may not be possible to accurately predict member status using the information provided in this dataset.

With that being said, I would like to at least try a reduced model to see what the results are. I will use the p-values from above as well as the parameter estimates to guide my decision making and to create a reduced model. Some of these variables don’t seem to be very useful at predicting member status so let’s remove them from the model and try again, this time with just product_class and store.

log_mod_reduced <- log_spec %>%
  fit(member ~ product_class + store, data = train_subset, family = "binomial")

tidy(log_mod_reduced) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) -0.2541328 0.1980084 -1.2834442 0.1993365
product_classFashion accessories 0.0570579 0.2401610 0.2375820 0.8122053
product_classFood and beverages 0.3517674 0.2436186 1.4439266 0.1487596
product_classHealth and beauty 0.1242045 0.2495210 0.4977716 0.6186450
product_classHome and lifestyle 0.2477315 0.2481198 0.9984347 0.3180686
product_classSports and travel 0.3583982 0.2447623 1.4642709 0.1431199
storeB 0.1041356 0.1736868 0.5995593 0.5488000
storeC 0.1333407 0.1742284 0.7653210 0.4440804

This model isn’t much better but we do see some evidence that certain product classes and stores are more associated with a member than otherwise. For instance, if a customer purchases a food/beverage item, the log odds of them being a member of the company increased by about 0.352. We can also see this with sports and travel goods. Perhaps members get a slight discount on items in these areas so they purchase more. I would be interested to explore this idea more, especially if I had more data and if I could explore purchase tendencies over time.

In general, we don’t seem to do much better than the full model but there are some small improvements such as smaller p-values and bigger estimates. Let’s stick with this reduced model moving forward.

Assessing the Model

I’m not a huge fan of this model for use in inference and predicting future values. We just don’t have enough information about the customers and transactions in order to make a confident prediction. However, for educational purposes, I would like to see how well the model performs when making predictions. We also should check our assumptions.

log_aug <- augment(log_mod_reduced, type.predict = "response", 
                      type.residuals = "deviance", new_data = train_subset) %>% 
                      mutate(id = row_number())

ggplot(data = log_aug, aes(x = .pred_class, y = member)) + 
  geom_point() + 
  geom_jitter() +
  labs(x = "Predicted Member Status", 
       y = "Actual Member Status", 
       title = "Member Status: Predicted vs. Actual",
       subtitle = "Training Data") +
  my_theme

This plot shows us how well the model did. It compares the actual member status with what the model would predict a customer to be. I did add a little bit of jitter to the plot so we could see each observation better (otherwise they all would lie on top of each other).

The ideal situation is 100% prediction accuracy where all of the points lie at (0, 0) meaning we predicted a customer to not be a member and they actually weren’t or at (1, 1) meaning we predicted a customer to be a member and they actually were. In the real world, we will have some error and will misclassify some observations, which appear in (0, 1) and (1, 0). Looking at this plot, we see that we did in fact predict the member status correctly for many customers!

My main concern is that there is a roughly equal amount of observations in each quadrant. The model seems to not do much better than us just guessing the member status in which case we would expect to get about 50% accuracy. Let’s see the actual numbers of our predictions.

log_aug %>% 
  rename("Predicted" = .pred_class,
         "Actual" = member) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Member Status vs. Actual Member Status")
Predicted Member Status vs. Actual Member Status
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

0

1

Total

0

214

199

413

1

183

203

386

Total

397

402

799

We see that we got \(214 + 203 = 417\) observations correctly predicted which means that we incorrectly predicted 382. We were \(\frac{417}{799} * 100 = 52.19\%\) accurate. Not much better than just randomly guessing.

It’s also a good idea to check our prediction accuracy with the testing dataset, although I don’t have much hope.

log_aug_test <- augment(log_mod_reduced, type.predict = "response", 
                      type.residuals = "deviance", new_data = test)

ggplot(data = log_aug_test, aes(x = .pred_class, y = member)) + 
  geom_point() + 
  geom_jitter() +
  labs(x = "Predicted Member Status", 
       y = "Actual Member Status", 
       title = "Member Status: Predicted vs. Actual",
       subtitle = "Testing Data") +
  my_theme

log_aug_test %>% 
  rename("Predicted" = .pred_class,
         "Actual" = member) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Member Status vs. Actual Member Status")
Predicted Member Status vs. Actual Member Status
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

0

1

Total

0

53

47

100

1

49

52

101

Total

102

99

201

We see similar results from the training data. We did correctly predict some observations, but there are still quite a handful that we missed. To quantify it, we correctly predicted 105 out of 201 customer member statuses which gives an accuracy rate of 52.24%.

model_performance(log_mod_full) %>% 
  kable()
AIC AICc BIC R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP
1129.119 1129.814 1204.052 0.0130321 0.4967318 1 0.6865573 -280.8741 0.0019635 0.5065354
model_performance(log_mod_reduced) %>% 
  kable()
AIC AICc BIC R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP
1119.157 1119.339 1156.624 0.0055728 0.4985954 1 0.6903358 -279.2792 0.0062764 0.5028059

Our full and reduced model are very similar. The AIC and BIC values differ very little between the two, although they have decreased with the reduced model. We also see a slight increase in the RMSE which is not ideal but an increase that small isn’t concerning. I do want to look at the Tjur’s \(R^2\) value which gives us a “typical” \(R^2\) for regression. This number is technically called the Coefficient of Discrimination but can be interpreted like a standard \(R^2\). So we explain about 1.3% of the variance in customer_type in our full model and only 0.6% with the reduced model, giving further evidence that our model is not great.

Assumptions

diagnostics <- plot(check_model(log_mod_reduced, panel = FALSE, type = "discrete_dots"))

Variable Types

We met this when we chose variables earlier. Our response variable is binary.

Independence of Observations

We are assuming this to be true, trusting the person/service that collected this data. It is possible that one person is represented more than one or that one transaction influenced another, however.

No Correlation Between Predictors

We used output from ggbivariate to explore this a bit above. I don’t think we have evidence to say that the predictors are correlated with each other.

However, we can use variance inflation factors (VIFs) to get a better idea of how correlated the predictor variables are.

log_mod_reduced %>% 
  extract_fit_engine() %>% 
  check_collinearity()
## # Check for Multicollinearity
## 
## Low Correlation
## 
##           Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  product_class 1.01 [1.00, 5.21]         1.01      0.99     [0.19, 1.00]
##          store 1.01 [1.00, 5.21]         1.01      0.99     [0.19, 1.00]

The output states that we have low correlation between our predictors which is great! We can also verify this by looking at the VIF column. All values are much less than 10, the common cutoff for evidence of multicollinearity.

Linearity Between Log Odds and Predictors

We can check this by plotting each variable against the log odds. I am going to skip this assumption since I am not confident with the model and since I will not be using the model for any inference.

We might also want to check on our errors.

diagnostics[[5]] +
  my_theme

We definitely don’t meet this condition as our points do not lie along the reference line. We should automatically question the validity of our results.

All Combinations of Variables are Accounted For

This is met. We have actually already verified this by looking at tables and counts. There are no combinations of variables such that we have a zero count. For instance, the number of members who went to Store A is non-zero, as is the number of non-members who bought a food/beverage.

Wrap Up

As we saw throughout this guide, using logistic regression to predict a customer’s member status did not turn out as well as we thought. Our error rate was just about as good as guessing randomly (50/50 chance). That being said, I do think we gained some valuable insight into customer tendencies, even if we are not able to accurately predict member status. We were able to look at some of the product classes that members are more likely to purchase and found that members were more likely to make purchases using a credit card. It is also worth noting the the range of transaction rating is roughly the same regardless of whether someone is a member or not. This could be useful to the company as it allows them to focus on improving the quality of all transactions without the need to worry about if someone is a member or not. All in all, it is important to remember that logistic regression is best used when the response variable is binary (or has two levels) with predictor variables that can be either continuous or categorical.

References

9 Logistic Regression Regression Diagnostics with R. n.d. Accessed April 18, 2024. https://sscc.wisc.edu/sscc/pubs/RegDiag-R/logistic-regression.html.
Dykes, Bradford. n.d. STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.
EmilHvitfeldt. 2021. “Answer to "Verify Model Assumptions with Tidymodels".” Stack Overflow. https://stackoverflow.com/a/69081682/17783912.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.
Paye, Aung. n.d. “Supermarket Sales.” Accessed April 18, 2024. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales.
Polo. 2021. “Verify Model Assumptions with Tidymodels.” Forum post. Stack Overflow. https://stackoverflow.com/q/69074437/17783912.