Welcome

Hey there! This is the fifth guide in my Introduction to Modeling with R series. In the previous guide, we looked at logistic regression which involved a binary response variable. This page will focus on multinomial regression, an extension of logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.

I again will start by importing our dataset as well as the packages we will be using. As mentioned in the previous guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(lubridate)
library(performance)
library(GGally)
library(gvsu215)
library(car)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that all set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Multinomial Regression

Although most of the processes for multinomial regression will be the same as or very similar to SLR, MLR, ad logistic regression, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. In logistic regression, the response variable had to be binary (or had to have two levels). What if the response variable has more than 2 levels?

Requirements

In order to get valid results from multinomial regression, we must meet certain assumptions.

  • The response variable must be categorical with more than two levels. If we have exactly two, we’d be doing logistic regression and if we had one, we’d have 100% accuracy.
  • Alternative categories/outcomes are independent
  • Our predictor variable(s) should not be correlated with each other
  • We must have a linear relationship between any continuous independent variables and the log odds of the response
  • The errors must be independent (no obvious clusters in the data)

Selecting Variables

First, let’s decide what we want to predict. Again, we need a variable that has more than two levels. We have a few choices: store, product_class, or pay_method. I am most interested in looking to see if we can predict which store a customer will shop at. This can be useful for a company to help in product ordering. If, for instance, we find that we can predict a customer’s destination based on the product class, we could instruct stores to order more products of certain classes to avoid out of stocks.

Technically, we can split up multinomial regression into simple multinomial regression (one predictor variable) and multiple multinomial regression (more than one predictor variable). However, I am going to jump right into multiple multinomial regression and dive into a model as predicting with one variable is not very exciting. As I did previously, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.

I would like to start with:

  • customer_type
  • gender
  • product_class
  • qty
  • total
  • pay_method
  • rating

This time, I also want to create a new variable hour that describes the hour of the day a customer visited a store. The following code creates a subset of the data (and also codes our response variable as a factor).

train_subset <- train %>% 
  mutate(store = factor(store),
         qty = factor(qty),
         hour = factor(hour(time))) %>% 
  select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)

test_subset <- test %>% 
  mutate(store = factor(store),
         qty = factor(qty),
         hour = factor(hour(time))) %>% 
  select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)

Exploring the Variables

Now that we are predicting a response variable with multiple levels, we have to rethink our model equation. Essentially, we are trying to predict the likelihood of a certain store being shopped at given the values of our predictor variables.

Like binomial regression, multinomial regression also requires us to compare against a “base” level. Since our response variable has three levels, we are essentially creating two logistic regression equations which allow us to compare to the “base” level. Recall that we have three stores: A, B, and C. If we choose store A to be our comparison level, we are actually running logistic regression with the following odds ratio

\[ \frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store B}}}{p_{\text{Store A}}} \] and

\[ \frac{Pr(Y_i = \text{Store C})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store C}}}{p_{\text{Store A}}} \] This can make interpretation tricky as all of our calculations are being done in comparison to Store A.

The math behind predicting this can get pretty involved. However, I would like to show that if we take the log of the above odds, we get a formula like this (which is derived from the more complicated logistic equation):

\[ \begin{equation*} \log\left(\frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})}\right) = \beta_0 + \beta_1 X_1 + ... \end{equation*} \] where the predictions for the beta coefficients are derived from the data from Store B.

This should look pretty familiar. We had a similar situation with logistic regression when we took the log of the odds. Here too we can take the log of the odds and we have a linear relationship between the predictors and the log odds, or logit.

Before officially building the model, let’s make a few plots to visualize our variables.

train_subset %>% 
  ggplot(aes(x = store, fill = customer_type)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Member Status",
       x = "Store",
       y = "Count",
       fill = "Member Status") +
  my_theme

It looks like we have very similar results per store here. Store A seems to attract more non-members but this may be because it has more customers overall. We may not see an effect here later on in the model.

train_subset %>% 
  ggplot(aes(x = store, fill = gender)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Gender",
       x = "Store",
       y = "Count",
       fill = "Gender") +
  my_theme

Similar to the above plot, we see more males visiting store A but, again, this could be due to the fact that more customers visit store A (although not by much).

train_subset %>% 
  ggplot(aes(x = store, fill = product_class)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Store vs. Product Class",
       x = "Store",
       y = "Count",
       fill = "Product Class") +
  my_theme

We may end up seeing some sort of effect in the model with this variable. I’m just eyeballing the plot here but it looks like Store A sells more Home/Lifestyle and Electronic goods and less Fashion Accessories while Store B sells more Sports/Travel items and Store C sells more Food/Beverages.

train_subset %>% 
  ggplot(aes(x = store, fill = qty)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Store Location vs. Quantity Sold",
       subtitle = "One Unique Item Per Transaction",
       x = "Store",
       y = "Count",
       fill = "Quantity") +
  my_theme

It is hard to tell if there is a pattern here. I do see that Store A has more products sold in the middle areas (4, 5, 6, 7) while Store B is more even across the board and Store C has more on the higher end. However this is just a generality and I’m not sure if we will keep this variable.

train_subset %>% 
  ggplot(aes(x = total, fill = store)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Transaction Total",
       x = "Total",
       y = "Store",
       fill = "Store") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  my_theme

Totals seem to be very similar across stores and skewed right with the higher totals being less common than lower totals. Store C does seem to have higher purchase totals but this may be because of a few higher values rather than an general tendency.

train_subset %>% 
  ggplot(aes(x = store, fill = pay_method)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Store vs. Pay Method",
       x = "Store",
       y = "Count",
       fill = "Pay Method") +
  my_theme