Welcome

Hey there! This is the fifth guide in my Introduction to Modeling with R series. In the previous guide, we looked at logistic regression which involved a binary response variable. This page will focus on multinomial regression, an extension of logistic regression. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.

I again will start by importing our dataset as well as the packages we will be using. As mentioned in the previous guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(lubridate)
library(performance)
library(GGally)
library(gvsu215)
library(car)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that all set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Multinomial Regression

Although most of the processes for multinomial regression will be the same as or very similar to SLR, MLR, ad logistic regression, there are a few key differences. The key piece is what our response variable looks like. Remember, for SLR and MLR, our response variable had to be continuous. In logistic regression, the response variable had to be binary (or had to have two levels). What if the response variable has more than 2 levels?

Requirements

In order to get valid results from multinomial regression, we must meet certain assumptions.

  • The response variable must be categorical with more than two levels. If we have exactly two, we’d be doing logistic regression and if we had one, we’d have 100% accuracy.
  • Alternative categories/outcomes are independent
  • Our predictor variable(s) should not be correlated with each other
  • We must have a linear relationship between any continuous independent variables and the log odds of the response
  • The errors must be independent (no obvious clusters in the data)

Selecting Variables

First, let’s decide what we want to predict. Again, we need a variable that has more than two levels. We have a few choices: store, product_class, or pay_method. I am most interested in looking to see if we can predict which store a customer will shop at. This can be useful for a company to help in product ordering. If, for instance, we find that we can predict a customer’s destination based on the product class, we could instruct stores to order more products of certain classes to avoid out of stocks.

Technically, we can split up multinomial regression into simple multinomial regression (one predictor variable) and multiple multinomial regression (more than one predictor variable). However, I am going to jump right into multiple multinomial regression and dive into a model as predicting with one variable is not very exciting. As I did previously, I am going to start by choosing every variable that I think might play a role in predicting the gender of a customer and use the regression output to narrow things down.

I would like to start with:

  • customer_type
  • gender
  • product_class
  • qty
  • total
  • pay_method
  • rating

This time, I also want to create a new variable hour that describes the hour of the day a customer visited a store. The following code creates a subset of the data (and also codes our response variable as a factor).

train_subset <- train %>% 
  mutate(store = factor(store),
         qty = factor(qty),
         hour = factor(hour(time))) %>% 
  select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)

test_subset <- test %>% 
  mutate(store = factor(store),
         qty = factor(qty),
         hour = factor(hour(time))) %>% 
  select(store, customer_type, gender, product_class, qty, total, pay_method, rating, hour)

Exploring the Variables

Now that we are predicting a response variable with multiple levels, we have to rethink our model equation. Essentially, we are trying to predict the likelihood of a certain store being shopped at given the values of our predictor variables.

Like binomial regression, multinomial regression also requires us to compare against a “base” level. Since our response variable has three levels, we are essentially creating two logistic regression equations which allow us to compare to the “base” level. Recall that we have three stores: A, B, and C. If we choose store A to be our comparison level, we are actually running logistic regression with the following odds ratio

\[ \frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store B}}}{p_{\text{Store A}}} \] and

\[ \frac{Pr(Y_i = \text{Store C})}{Pr(Y_i = \text{Store A})} = \frac{p_{\text{Store C}}}{p_{\text{Store A}}} \] This can make interpretation tricky as all of our calculations are being done in comparison to Store A.

The math behind predicting this can get pretty involved. However, I would like to show that if we take the log of the above odds, we get a formula like this (which is derived from the more complicated logistic equation):

\[ \begin{equation*} \log\left(\frac{Pr(Y_i = \text{Store B})}{Pr(Y_i = \text{Store A})}\right) = \beta_0 + \beta_1 X_1 + ... \end{equation*} \] where the predictions for the beta coefficients are derived from the data from Store B.

This should look pretty familiar. We had a similar situation with logistic regression when we took the log of the odds. Here too we can take the log of the odds and we have a linear relationship between the predictors and the log odds, or logit.

Before officially building the model, let’s make a few plots to visualize our variables.

train_subset %>% 
  ggplot(aes(x = store, fill = customer_type)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Member Status",
       x = "Store",
       y = "Count",
       fill = "Member Status") +
  my_theme

It looks like we have very similar results per store here. Store A seems to attract more non-members but this may be because it has more customers overall. We may not see an effect here later on in the model.

train_subset %>% 
  ggplot(aes(x = store, fill = gender)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Gender",
       x = "Store",
       y = "Count",
       fill = "Gender") +
  my_theme

Similar to the above plot, we see more males visiting store A but, again, this could be due to the fact that more customers visit store A (although not by much).

train_subset %>% 
  ggplot(aes(x = store, fill = product_class)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Store vs. Product Class",
       x = "Store",
       y = "Count",
       fill = "Product Class") +
  my_theme

We may end up seeing some sort of effect in the model with this variable. I’m just eyeballing the plot here but it looks like Store A sells more Home/Lifestyle and Electronic goods and less Fashion Accessories while Store B sells more Sports/Travel items and Store C sells more Food/Beverages.

train_subset %>% 
  ggplot(aes(x = store, fill = qty)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Store Location vs. Quantity Sold",
       subtitle = "One Unique Item Per Transaction",
       x = "Store",
       y = "Count",
       fill = "Quantity") +
  my_theme

It is hard to tell if there is a pattern here. I do see that Store A has more products sold in the middle areas (4, 5, 6, 7) while Store B is more even across the board and Store C has more on the higher end. However this is just a generality and I’m not sure if we will keep this variable.

train_subset %>% 
  ggplot(aes(x = total, fill = store)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Transaction Total",
       x = "Total",
       y = "Store",
       fill = "Store") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  my_theme

Totals seem to be very similar across stores and skewed right with the higher totals being less common than lower totals. Store C does seem to have higher purchase totals but this may be because of a few higher values rather than an general tendency.

train_subset %>% 
  ggplot(aes(x = store, fill = pay_method)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Store vs. Pay Method",
       x = "Store",
       y = "Count",
       fill = "Pay Method") +
  my_theme

Once again, results are similar across stores; however, I do see that Store A has more customers paying with Ewallet, Store B more with credit cards, and Store C more with cash. We will see if this difference is large enough to pop up in our model, but this may be useful information regardless.

train_subset %>% 
  ggplot(aes(x = rating, fill = store)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Dark2") +
  labs(title = "Visits to Stores by Transaction Rating",
       x = "Rating",
       y = "Store",
       fill = "Store") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  my_theme

It looks like all stores have high and low ratings but, on average Store B is rated lower than the other two. It is unlikely that this will have a big effect in the model but I think this helps the company realize they should focus on increasing customer satisfaction at all stores, especially Store B.

train_subset %>% 
  ggplot(aes(x = store, fill = hour)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Store vs. Hour Visited",
       x = "Store",
       y = "Count",
       fill = "Hour") +
  my_theme

It looks like all stores are open between the hours of 10am and 8pm. Store A attracts customers right at open and near the early afternoon but becomes less busy as the day goes on. Store B starts off pretty slow, gets a small rush early afternoon and finishes the day strong with the most customers arriving around 6 and 7 in the evening. Store C has roughly the same amount of traffic all day with small peaks at noon and 5pm. This information can be vital to the company to help them better prepare for rushes and make sure the shelves are stocked in preparation.

To wrap up, it’s a good idea to look at all of the variables together (as we have done before) to look for any potential interaction terms or transformations (e.g., square root, log, square, cube, etc.).

ggpairs(train_subset) +
  my_theme

We are using the total variable in this model so we will want to apply a transformation to it here, such as the square root. I’d also like to explore a potential interaction between product class and payment method.

Building the Model

Now that we have an understanding of what the multinomial model looks like, let’s use R to build one!

train_subset <- train_subset %>% 
  mutate(sqrt_total = sqrt(total))
test_subset <- test_subset %>% 
  mutate(sqrt_total = sqrt(total))

multi_spec <- multinom_reg() %>%
  set_engine("nnet")

multi_mod_full <- multi_spec %>%
  fit(store ~ customer_type + gender + product_class + qty + sqrt_total + pay_method + rating + hour + product_class*pay_method, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(multi_mod_full) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level term estimate std.error statistic p.value
B (Intercept) 0.1230925 0.6639102 0.1854054 0.8529111
B customer_typeNormal -0.1022425 0.1806509 -0.5659671 0.5714161
B genderMale -0.2173637 0.1826443 -1.1900926 0.2340100
B product_classFashion accessories 0.5175410 0.5569480 0.9292448 0.3527622
B product_classFood and beverages -0.9130753 0.5954924 -1.5333115 0.1251991
B product_classHealth and beauty -0.7073235 0.5587781 -1.2658398 0.2055704
B product_classHome and lifestyle -0.9997502 0.5450608 -1.8341992 0.0666244
B product_classSports and travel 0.0762935 0.4935285 0.1545878 0.8771463
B qty2 0.4222929 0.4121049 1.0247221 0.3054943
B qty3 0.2278580 0.4128217 0.5519527 0.5809808
B qty4 0.2963050 0.4139034 0.7158796 0.4740657
B qty5 -0.3161466 0.4301869 -0.7349051 0.4623973
B qty6 0.3926758 0.4539686 0.8649846 0.3870472
B qty7 -0.1393764 0.4667160 -0.2986321 0.7652208
B qty8 0.5507095 0.5071525 1.0858855 0.2775297
B qty9 0.6045089 0.5060219 1.1946298 0.2322317
B qty10 0.0715738 0.4979165 0.1437465 0.8857006
B sqrt_total -0.0042119 0.0197610 -0.2131419 0.8312163
B pay_methodCredit card -0.4048292 0.5226017 -0.7746419 0.4385513
B pay_methodEwallet -0.4762357 0.5294266 -0.8995311 0.3683699
B rating -0.0167801 0.0526306 -0.3188279 0.7498570
B hour11 0.2861487 0.3999898 0.7153900 0.4743681
B hour12 -0.2638694 0.4159754 -0.6343389 0.5258597
B hour13 0.2386122 0.3928605 0.6073714 0.5436045
B hour14 0.4756167 0.4222767 1.1263154 0.2600320
B hour15 0.2044245 0.4001754 0.5108372 0.6094651
B hour16 -0.1895534 0.4403074 -0.4305024 0.6668302
B hour17 -0.0257888 0.4319457 -0.0597039 0.9523915
B hour18 0.2614289 0.3980759 0.6567313 0.5113537
B hour19 0.7217362 0.3941421 1.8311575 0.0670770
B hour20 0.3407303 0.4357730 0.7818986 0.4342742
B product_classFashion accessories:pay_methodCredit card -0.2332280 0.7788595 -0.2994481 0.7645982
B product_classFood and beverages:pay_methodCredit card 1.2963956 0.7825805 1.6565652 0.0976074
B product_classHealth and beauty:pay_methodCredit card 1.5547850 0.7905627 1.9666814 0.0492199
B product_classHome and lifestyle:pay_methodCredit card 1.0702993 0.7905656 1.3538399 0.1757875
B product_classSports and travel:pay_methodCredit card 0.4670240 0.7256503 0.6435937 0.5198390
B product_classFashion accessories:pay_methodEwallet -0.0881054 0.7555997 -0.1166033 0.9071745
B product_classFood and beverages:pay_methodEwallet 1.0499598 0.8043774 1.3053074 0.1917882
B product_classHealth and beauty:pay_methodEwallet 1.1756288 0.7756213 1.5157253 0.1295888
B product_classHome and lifestyle:pay_methodEwallet 1.0617632 0.7560930 1.4042759 0.1602367
B product_classSports and travel:pay_methodEwallet 0.0704697 0.7370418 0.0956116 0.9238291
C (Intercept) 0.3740907 0.6443545 0.5805666 0.5615326
C customer_typeNormal -0.1035452 0.1813948 -0.5708279 0.5681163
C genderMale -0.3161325 0.1837407 -1.7205360 0.0853351
C product_classFashion accessories -0.0406714 0.5478844 -0.0742335 0.9408246
C product_classFood and beverages -0.0241859 0.4875570 -0.0496064 0.9604361
C product_classHealth and beauty -0.5513391 0.5070656 -1.0873133 0.2768984
C product_classHome and lifestyle -1.0130234 0.5046434 -2.0074045 0.0447066
C product_classSports and travel -0.9204218 0.5189377 -1.7736654 0.0761185
C qty2 0.0864847 0.4047179 0.2136913 0.8307878
C qty3 -0.4552864 0.4236196 -1.0747527 0.2824854
C qty4 -0.3009137 0.4107135 -0.7326609 0.4637653
C qty5 -0.7120384 0.4271518 -1.6669447 0.0955254
C qty6 -0.3797564 0.4609512 -0.8238537 0.4100227
C qty7 -0.2263241 0.4472796 -0.5060014 0.6128557
C qty8 0.0489264 0.4954365 0.0987542 0.9213335
C qty9 -0.2185088 0.5059906 -0.4318437 0.6658550
C qty10 -0.1735220 0.4873218 -0.3560728 0.7217860
C sqrt_total 0.0186506 0.0197400 0.9448127 0.3447545
C pay_methodCredit card -1.2569920 0.5514493 -2.2794334 0.0226413
C pay_methodEwallet -0.6660243 0.5080869 -1.3108470 0.1899094
C rating 0.0258671 0.0527362 0.4905001 0.6237800
C hour11 -0.2368580 0.4189176 -0.5654046 0.5717986
C hour12 0.1739025 0.3855400 0.4510623 0.6519446
C hour13 0.1750183 0.3875151 0.4516425 0.6515265
C hour14 0.3370850 0.4201652 0.8022676 0.4223982
C hour15 0.1315039 0.3923755 0.3351482 0.7375133
C hour16 0.0809160 0.4134694 0.1957000 0.8448450
C hour17 0.1459642 0.4111131 0.3550465 0.7225548
C hour18 -0.3088176 0.4141024 -0.7457517 0.4558174
C hour19 0.2246624 0.4036445 0.5565847 0.5778112
C hour20 0.2964526 0.4234660 0.7000622 0.4838884
C product_classFashion accessories:pay_methodCredit card 1.0148245 0.7965850 1.2739688 0.2026745
C product_classFood and beverages:pay_methodCredit card 0.2753167 0.7682468 0.3583702 0.7200663
C product_classHealth and beauty:pay_methodCredit card 1.8845297 0.7843329 2.4027167 0.0162738
C product_classHome and lifestyle:pay_methodCredit card 1.7588866 0.7890985 2.2289824 0.0258151
C product_classSports and travel:pay_methodCredit card 1.6970046 0.7735810 2.1936998 0.0282570
C product_classFashion accessories:pay_methodEwallet 0.2511211 0.7436477 0.3376882 0.7355981
C product_classFood and beverages:pay_methodEwallet 0.1895658 0.7175306 0.2641920 0.7916320
C product_classHealth and beauty:pay_methodEwallet 0.4636049 0.7537622 0.6150546 0.5385187
C product_classHome and lifestyle:pay_methodEwallet 0.7819534 0.7280234 1.0740773 0.2827881
C product_classSports and travel:pay_methodEwallet 0.5345029 0.7740678 0.6905117 0.4898725
model_performance(multi_mod_full) %>% 
  kable()
## Can't calculate log-loss.
AIC AICc BIC R2 R2_adjusted RMSE Sigma
1843.07 1862.081 2227.105 0.043475 0.0423356 0.4604025 1.530294

This gives us a huge output. Remember, we are really running two regression models at the same time, comparing Store B to Store A and Store C to Store A. So, for example, if you purchase a home/lifestyle good, we would predict your log odds of shopping at Store B to be 1.00 compared to Store A, holding all other variables constant.

We will certainly not be using this model as our final model. There are way too many variables and too information to consider. Moreover, it doesn’t look like all of these variables will have equal use in predicting the odds of shopping at a certain store. As with logistic regression, I will not be relying solely on p-values to determine which variables to keep. I am going to be looking at them to get an idea of how influential a variable might be, but I will also be thinking about which variables might have practical effect on the results.

Looking through, I see the following:

  • Store B vs. Store A
    • Gender could play a role here
    • Some of the product classes have some influence but not all (this variable makes sense to include, however)
    • Some of the quantity variables seem to have an effect but I’m not sure it’s enough to merit keeping the variable
    • I am on the fence about payment method. I really want it to be useful but I’m not too sure it is.
    • For the most part, hour doesn’t say much compared to Store A. However, we do have a highly significant p-value here which makes me think that hour plays some role.
    • It looks like we may actually have an interaction between product class and payment method. A lot of these variables have small p-values and larger estimates.
  • Store C vs. Store A
    • Gender plays a role here
    • Same situation with product class. This variable makes logical sense to keep in the model but some of the categories just don’t show a big difference compared to Store A
    • For the most part, quantity plays little role in predicting the log odds of being in Store C vs. Store A. However, we do see that if you are buying 5 of an item, your predicted log odds decreases by 0.672.
    • Same situation with payment method.
    • Hour could say something, but not a lot here.
    • Similar situation with the interaction variable.

You may have noticed that a lot of our observations from looking at plots above have appeared in the results here. This is why we run regression. Sure, we can visualize trends but it’s nice to have a statistical test to tell us how big of an effect the things we have observed are having on the outcome.

I now want to build a reduced model by removing some of the variables. I think we can remove customer_type as there didn’t seem to be much of a difference between stores. I also want to remove qty. After thinking about it, it probably doesn’t make sense to predict which store someone will shop at based on a variable that is determined while they are shopping. I also am taking out the rating and log_total variables. We saw earlier that ratings and totals were across the board for all stores and now we see that it makes a marginal difference in the log odds. I will be keeping the interaction in the model and therefore should also keep the two variables that make up that interaction: product class and pay method.

multi_mod_reduced <- multi_spec %>%
  fit(store ~ gender + product_class + pay_method + hour + product_class*pay_method, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(multi_mod_reduced) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level term estimate std.error statistic p.value
B (Intercept) 0.0324376 0.4418813 0.0734079 0.9414815
B genderMale -0.2181609 0.1792318 -1.2171993 0.2235284
B product_classFashion accessories 0.4593263 0.5475887 0.8388162 0.4015724
B product_classFood and beverages -0.8620287 0.5859534 -1.4711556 0.1412490
B product_classHealth and beauty -0.7111274 0.5530655 -1.2857923 0.1985156
B product_classHome and lifestyle -0.8941556 0.5392373 -1.6581859 0.0972799
B product_classSports and travel 0.0567547 0.4868340 0.1165792 0.9071936
B pay_methodCredit card -0.3808298 0.5140045 -0.7409076 0.4587495
B pay_methodEwallet -0.4487375 0.5224286 -0.8589451 0.3903708
B hour11 0.3106743 0.3936037 0.7893075 0.4299323
B hour12 -0.1583406 0.4081229 -0.3879728 0.6980361
B hour13 0.2742805 0.3868920 0.7089330 0.4783660
B hour14 0.5601527 0.4138768 1.3534284 0.1759188
B hour15 0.2418397 0.3940304 0.6137589 0.5393746
B hour16 -0.1627922 0.4351202 -0.3741315 0.7083065
B hour17 0.0052286 0.4265025 0.0122591 0.9902189
B hour18 0.2705395 0.3932515 0.6879553 0.4914809
B hour19 0.7615430 0.3852854 1.9765687 0.0480904
B hour20 0.3738848 0.4307943 0.8678964 0.3854511
B product_classFashion accessories:pay_methodCredit card -0.1761060 0.7691999 -0.2289470 0.8189101
B product_classFood and beverages:pay_methodCredit card 1.1505224 0.7718011 1.4906981 0.1360408
B product_classHealth and beauty:pay_methodCredit card 1.5352941 0.7806878 1.9665915 0.0492303
B product_classHome and lifestyle:pay_methodCredit card 1.0415556 0.7813768 1.3329749 0.1825401
B product_classSports and travel:pay_methodCredit card 0.4056389 0.7134346 0.5685720 0.5696466
B product_classFashion accessories:pay_methodEwallet -0.0546555 0.7459788 -0.0732669 0.9415938
B product_classFood and beverages:pay_methodEwallet 1.0396773 0.7937828 1.3097756 0.1902718
B product_classHealth and beauty:pay_methodEwallet 1.1450831 0.7681274 1.4907463 0.1360281
B product_classHome and lifestyle:pay_methodEwallet 0.9440606 0.7445859 1.2679001 0.2048336
B product_classSports and travel:pay_methodEwallet 0.0665056 0.7240331 0.0918544 0.9268137
C (Intercept) 0.5089231 0.4190068 1.2145939 0.2245210
C genderMale -0.2890749 0.1802066 -1.6041308 0.1086852
C product_classFashion accessories -0.0206193 0.5404320 -0.0381534 0.9695654
C product_classFood and beverages 0.0164166 0.4761684 0.0344765 0.9724972
C product_classHealth and beauty -0.5296649 0.5018916 -1.0553372 0.2912711
C product_classHome and lifestyle -0.9393242 0.4982635 -1.8851956 0.0594034
C product_classSports and travel -0.9334757 0.5128043 -1.8203350 0.0687080
C pay_methodCredit card -1.2504315 0.5448502 -2.2950005 0.0217331
C pay_methodEwallet -0.6574219 0.5018185 -1.3100789 0.1901691
C hour11 -0.1735148 0.4136739 -0.4194483 0.6748886
C hour12 0.2236018 0.3804245 0.5877691 0.5566873
C hour13 0.1796662 0.3828059 0.4693403 0.6388264
C hour14 0.4229653 0.4137510 1.0222703 0.3066530
C hour15 0.1710339 0.3867843 0.4421945 0.6583485
C hour16 0.1099455 0.4087028 0.2690108 0.7879214
C hour17 0.1695118 0.4057528 0.4177711 0.6761145
C hour18 -0.2904513 0.4100855 -0.7082701 0.4787776
C hour19 0.2716885 0.3956510 0.6866874 0.4922798
C hour20 0.3381913 0.4182171 0.8086500 0.4187165
C product_classFashion accessories:pay_methodCredit card 1.0770044 0.7893959 1.3643400 0.1724606
C product_classFood and beverages:pay_methodCredit card 0.2050424 0.7584138 0.2703569 0.7868857
C product_classHealth and beauty:pay_methodCredit card 1.8646395 0.7757981 2.4035115 0.0162385
C product_classHome and lifestyle:pay_methodCredit card 1.6587394 0.7787565 2.1299846 0.0331729
C product_classSports and travel:pay_methodCredit card 1.8085889 0.7639346 2.3674656 0.0179104
C product_classFashion accessories:pay_methodEwallet 0.2596970 0.7356182 0.3530323 0.7240642
C product_classFood and beverages:pay_methodEwallet 0.1482658 0.7080370 0.2094040 0.8341328
C product_classHealth and beauty:pay_methodEwallet 0.4181683 0.7464316 0.5602232 0.5753272
C product_classHome and lifestyle:pay_methodEwallet 0.7349214 0.7168877 1.0251555 0.3052898
C product_classSports and travel:pay_methodEwallet 0.5216228 0.7628310 0.6837986 0.4941024
model_performance(multi_mod_reduced) %>% 
  kable()
## Can't calculate log-loss.
## Can't calculate proper scoring rules for ordinal, multinomial or
##   cumulative link models.
AIC AICc BIC R2 R2_adjusted RMSE Sigma
1813.663 1822.912 2085.298 0.0328826 0.0317433 0.4629543 1.513619

This still gives us a large model with many variables. Part of this is due to how many levels the hour variable and the interaction variable contains. I hesitate to remove hour since some of the hours seem to play a role in predicting the log odds of going to a certain store. Before moving forward with this model (which does seem to be a slight improvement from the full model as we have lowered the AIC and BIC), I would like to fit one more model without hour to see what happens.

multi_mod_nohour <- multi_spec %>%
  fit(store ~ gender + product_class + pay_method + product_class*pay_method, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(multi_mod_nohour) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level term estimate std.error statistic p.value
B (Intercept) 0.2785966 0.3604863 0.7728355 0.4396198
B genderMale -0.1954871 0.1758289 -1.1118031 0.2662228
B product_classFashion accessories 0.4572940 0.5408788 0.8454648 0.3978514
B product_classFood and beverages -0.8899751 0.5806029 -1.5328464 0.1253137
B product_classHealth and beauty -0.6739486 0.5483441 -1.2290615 0.2190487
B product_classHome and lifestyle -0.9709108 0.5327123 -1.8225800 0.0683670
B product_classSports and travel -0.0245208 0.4810784 -0.0509705 0.9593491
B pay_methodCredit card -0.4584183 0.5083734 -0.9017353 0.3671975
B pay_methodEwallet -0.4450654 0.5185653 -0.8582630 0.3907472
B product_classFashion accessories:pay_methodCredit card -0.0397798 0.7565471 -0.0525808 0.9580659
B product_classFood and beverages:pay_methodCredit card 1.3344234 0.7633087 1.7482093 0.0804278
B product_classHealth and beauty:pay_methodCredit card 1.5116425 0.7718664 1.9584251 0.0501801
B product_classHome and lifestyle:pay_methodCredit card 1.1559684 0.7708827 1.4995386 0.1337340
B product_classSports and travel:pay_methodCredit card 0.4938071 0.7056887 0.6997520 0.4840822
B product_classFashion accessories:pay_methodEwallet -0.0915018 0.7382811 -0.1239389 0.9013637
B product_classFood and beverages:pay_methodEwallet 1.0967486 0.7882034 1.3914537 0.1640879
B product_classHealth and beauty:pay_methodEwallet 1.0916698 0.7622508 1.4321661 0.1520963
B product_classHome and lifestyle:pay_methodEwallet 1.0178712 0.7368889 1.3813089 0.1671840
B product_classSports and travel:pay_methodEwallet 0.1440879 0.7157792 0.2013021 0.8404623
C (Intercept) 0.6010893 0.3408282 1.7636135 0.0777971
C genderMale -0.2712891 0.1772239 -1.5307703 0.1258262
C product_classFashion accessories 0.0447680 0.5356529 0.0835765 0.9333932
C product_classFood and beverages 0.0491046 0.4715041 0.1041446 0.9170546
C product_classHealth and beauty -0.5121843 0.4974753 -1.0295673 0.3032132
C product_classHome and lifestyle -0.9671051 0.4924182 -1.9639912 0.0495311
C product_classSports and travel -0.9112194 0.5086908 -1.7913033 0.0732446
C pay_methodCredit card -1.2345571 0.5413370 -2.2805701 0.0225739
C pay_methodEwallet -0.6430008 0.4989594 -1.2886837 0.1975081
C product_classFashion accessories:pay_methodCredit card 0.9859724 0.7793577 1.2651090 0.2058322
C product_classFood and beverages:pay_methodCredit card 0.1927078 0.7525470 0.2560741 0.7978936
C product_classHealth and beauty:pay_methodCredit card 1.8507962 0.7688588 2.4071992 0.0160754
C product_classHome and lifestyle:pay_methodCredit card 1.7259267 0.7706736 2.2395041 0.0251231
C product_classSports and travel:pay_methodCredit card 1.8075592 0.7599656 2.3784749 0.0173844
C product_classFashion accessories:pay_methodEwallet 0.2305947 0.7303134 0.3157475 0.7521941
C product_classFood and beverages:pay_methodEwallet 0.1419276 0.7040976 0.2015737 0.8402500
C product_classHealth and beauty:pay_methodEwallet 0.4735933 0.7417994 0.6384385 0.5231883
C product_classHome and lifestyle:pay_methodEwallet 0.7935966 0.7103644 1.1171684 0.2639223
C product_classSports and travel:pay_methodEwallet 0.4951504 0.7572145 0.6539103 0.5131696
model_performance(multi_mod_nohour) %>% 
  kable()
## Can't calculate log-loss.
## Can't calculate proper scoring rules for ordinal, multinomial or
##   cumulative link models.
AIC AICc BIC R2 R2_adjusted RMSE Sigma
1790.054 1793.954 1968.022 0.0235453 0.022406 0.4654612 1.50079

Well…I’m not upset about this one! We’ve lowered the AIC and BIC once again although our RMSE has increased only slightly. The \(R^2\) has gone down but it wasn’t very large at all to begin with. At this point, I’m not really sure which model is “better”. It is very nice to remove hour and lose a large number of levels seen in the model. But there were a few levels of hour that seemed to play a role in the model and I hesitate just dropping it.

After thinking about it, I am going to move forward with both of the models. I will look at our model accuracy on both of the models first and then pick the model with the highest prediction accuracy.

Assessing the Model

Let’s see how we did!

Reduced Model

multi_aug <- augment(multi_mod_reduced, new_data = train_subset)

ggplot(data = multi_aug, aes(x = .pred_class, y = store)) + 
  geom_point() + 
  geom_jitter() +
  labs(x = "Predicted Store", 
       y = "Actual Store ", 
       title = "Store Visited: Predicted vs. Actual",
       subtitle = "Reduced Model | Training Data") +
  my_theme

Well, the reduced model didn’t do terribly but also not the best. We want to see the most dots as possible along the diagonal from the bottom left to the top right. This would mean we predicted a customer would go to a certain store and they actually did. We do see that we correctly predicted a nice handful of observations. That being said, there are quite a few that we did not predict correctly. We have about 30-40 observations in all possible combinations of stores meaning that we made some wrong predictions in every way possible.

Let’s see it in table form.

multi_aug %>% 
  rename("Predicted" = .pred_class,
         "Actual" = store) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Store Visited vs. Actual Store Visited \n Reduced Model")
Predicted Store Visited vs. Actual Store Visited
Reduced Model
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

A

B

C

Total

A

117

85

78

280

B

80

122

72

274

C

75

58

112

245

Total

272

265

262

799

Officially, we correctly predicted 351 out of the 799 observations for a score of 43.93%. Not great but better than 33% which is where I would expect to be if we were just guessing randomly!

We should also check with the testing dataset.

multi_aug_test <- augment(multi_mod_reduced, new_data = test_subset)

ggplot(data = multi_aug_test, aes(x = .pred_class, y = store)) + 
  geom_point() + 
  geom_jitter(width = 0.35, height = 0.35) +
  labs(x = "Predicted Store", 
       y = "Actual Store ", 
       title = "Store Visited: Predicted vs. Actual",
       subtitle = "Reduced Model | Testing Data") +
  my_theme

multi_aug_test %>% 
  rename("Predicted" = .pred_class,
         "Actual" = store) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Store Visited vs. Actual Store Visited (Testing) \n No Hour Model")
Predicted Store Visited vs. Actual Store Visited (Testing)
No Hour Model
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

A

B

C

Total

A

25

26

17

68

B

22

28

23

73

C

21

13

26

60

Total

68

67

66

201

We see similar results. We have correctly predicted 79 out of the 201 observations for a score of 39.3%. This is worse than the training data (which is expected) but is also approaching the 33% realm which is what would expect to get if we just guessed randomly.

The “No Hour” Model

multi_aug_nohour <- augment(multi_mod_nohour, new_data = train_subset)

ggplot(data = multi_aug_nohour, aes(x = .pred_class, y = store)) + 
  geom_point() + 
  geom_jitter() +
  labs(x = "Predicted Store", 
       y = "Actual Store ", 
       title = "Store Visited: Predicted vs. Actual",
       subtitle = "No Hour Model | Training Data") +
  my_theme

Just looking at this plot, I don’t see any striking evidence that we did any better or worse by taking hour out of the model. We still have plenty of observations where we incorrectly predict their store value.

multi_aug_nohour %>% 
  rename("Predicted" = .pred_class,
         "Actual" = store) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Store Visited vs. Actual Store Visited \n No Hour Model")
Predicted Store Visited vs. Actual Store Visited
No Hour Model
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

A

B

C

Total

A

113

82

76

271

B

85

111

83

279

C

74

72

103

249

Total

272

265

262

799

We have less correct predictions here with a score of 327 out of 799 which gives a 40.93% accuracy.

multi_aug_test_nohour <- augment(multi_mod_nohour, new_data = test_subset)

ggplot(data = multi_aug_test_nohour, aes(x = .pred_class, y = store)) + 
  geom_point() + 
  geom_jitter(width = 0.35, height = 0.35) +
  labs(x = "Predicted Store", 
       y = "Actual Store ", 
       title = "Store Visited: Predicted vs. Actual",
       subtitle = "No Hour Model | Testing Data") +
  my_theme

multi_aug_test_nohour %>% 
  rename("Predicted" = .pred_class,
         "Actual" = store) %>% 
  tbl_2var(Predicted~Actual,
           caption = "Predicted Store Visited vs. Actual Store Visited (Testing) \n No Hour Model")
Predicted Store Visited vs. Actual Store Visited (Testing)
No Hour Model
Predicted Missing: 0 | Actual Missing: 0
NAs Removed: No

Actual

Predicted

A

B

C

Total

A

29

27

17

73

B

19

21

25

65

C

20

19

24

63

Total

68

67

66

201

With the testing data, we predict 74/201 correct for a score of 36.82%. Based on these results, I think I am willing to sacrifice the extra variables and keep the model that includes hour. We will continue to move on and check model assumptions with this model.

Checking Assumptions

Even though our model is not performing as well as I’d hoped, it still is good to check our assumptions. It is possible that part of the reason why our model is not satisfactory is due to not meeting some assumptions.

diagnostics <- plot(check_model(multi_mod_reduced, panel = FALSE, residual_type = "normal"))

Variable Types

We meet this since we chose a response variable (store) with more than two levels: A, B, and C.

Alternative Categories / Outcomes are Independent

Rose Werth has a good explanation of what this assumption really means. Essentially “this assumption states that the relative likelihood of being in one category compared to the base category would not change if you added any other categories.” We should be confident that there are no confounding variables out that that might change our outcome if they were considered.

I honestly don’t think we meet this assumption here. For instance, what if I told you the weather on a certain day or hour. Or what if I gave you traffic information about a certain day our hour. Would this influence your choice of store? Maybe, maybe not. But it is possible that knowing this information might change the outcome. For example, what if you were planning on going to Store B but I told you there was a car accident on the way there and you decided to go to Store C instead to avoid the traffic. That’s a change in the outcome.

I think this violation plays a role in our models lack of major success. Because there are so many other variables out there that we could collect and consider, we aren’t able to make an accurate prediction.

No Correlation Between Predictors

As before, let’s look at variance inflation factors to look at multicollinearity.

multi_mod_reduced %>% 
  extract_fit_engine() %>% 
  check_collinearity()
## Model has interaction terms. VIFs might be inflated.
##   You may check multicollinearity among predictors of a model without
##   interaction terms.
## # Check for Multicollinearity
## 
## Low Correlation
## 
##    Term  VIF           VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  gender 1.37 [    1.27,     1.50]         1.17      0.73     [0.67, 0.78]
## 
## High Correlation
## 
##                      Term      VIF           VIF 95% CI Increased SE Tolerance
##             product_class  1907.92 [ 1678.42,  2168.82]        43.68  5.24e-04
##                pay_method    61.15 [   53.86,    69.45]         7.82      0.02
##                      hour    20.28 [   17.91,    23.00]         4.50      0.05
##  product_class:pay_method 1.53e+05 [1.34e+05, 1.74e+05]       390.70  6.55e-06
##  Tolerance 95% CI
##      [0.00, 0.00]
##      [0.01, 0.02]
##      [0.04, 0.06]
##      [0.00, 0.00]

This is not good at all, but remember that we have an interaction term in our model so we do expect some correlation there. We should check VIFs without any interactions.

multi_spec %>%
  fit(store ~ gender + product_class + pay_method + hour, data = train_subset) %>% 
  repair_call(data = train_subset) %>% 
  extract_fit_engine() %>% 
  check_collinearity()
## # Check for Multicollinearity
## 
## Low Correlation
## 
##           Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##         gender 1.37 [ 1.27,  1.50]         1.17      0.73     [0.67, 0.79]
##  product_class 4.46 [ 3.97,  5.02]         2.11      0.22     [0.20, 0.25]
##     pay_method 1.81 [ 1.65,  2.00]         1.34      0.55     [0.50, 0.61]
## 
## High Correlation
## 
##  Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##  hour 18.33 [16.14, 20.84]         4.28      0.05     [0.05, 0.06]

We have evidence for low correlation between predictors. gender, product_class, and pay_method all have VIF values of less than 5. product_class is getting close to showing evidence of correlation with the other predictors but not bad.

What strikes me more is the massive VIF for hour. At 18.33, it is clearly correlated with other predictors. At this point, it should be dropped or combined into another variable. I’m not sure we could get a better model with the data we have, but curiosity strikes me to test something. I am going to create a new variable that merges some of the hours together and then I’ll run a new model on that. I won’t include the interaction for demonstration purposes.

train_subset <- train_subset %>% 
  mutate(time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
                                 hour %in% c(15, 16, 17, 18) ~ "Afternoon",
                                 TRUE ~ "Evening"))

multi_mod_reduced2 <- multi_spec %>%
  fit(store ~  gender + product_class + pay_method + time_of_day, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(multi_mod_reduced2) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"), fixed_thead = TRUE)
y.level term estimate std.error statistic p.value
B (Intercept) -0.2183208 0.2842962 -0.7679344 0.4425261
B genderMale -0.1910915 0.1758398 -1.0867362 0.2771534
B product_classFashion accessories 0.3385894 0.2997552 1.1295533 0.2586645
B product_classFood and beverages -0.0727605 0.3032538 -0.2399326 0.8103825
B product_classHealth and beauty 0.1715636 0.3087674 0.5556403 0.5784568
B product_classHome and lifestyle -0.2571213 0.3039655 -0.8458898 0.3976142
B product_classSports and travel 0.1813320 0.2933302 0.6181838 0.5364542
B pay_methodCredit card 0.2052460 0.2175893 0.9432723 0.3455416
B pay_methodEwallet 0.0230733 0.2140427 0.1077976 0.9141563
B time_of_dayEvening 0.5343003 0.2550518 2.0948697 0.0361826
B time_of_dayMidday 0.1138453 0.1966973 0.5787843 0.5627347
C (Intercept) 0.1442794 0.2765776 0.5216598 0.6019072
C genderMale -0.2736713 0.1761162 -1.5539250 0.1202023
C product_classFashion accessories 0.3474356 0.2981897 1.1651496 0.2439585
C product_classFood and beverages 0.1151221 0.2945517 0.3908384 0.6959167
C product_classHealth and beauty 0.1815095 0.3070232 0.5911916 0.5543921
C product_classHome and lifestyle -0.2175378 0.2996763 -0.7259091 0.4678945
C product_classSports and travel -0.2023096 0.3051752 -0.6629293 0.5073758
C pay_methodCredit card -0.1645312 0.2156844 -0.7628331 0.4455629
C pay_methodEwallet -0.3188419 0.2110679 -1.5106129 0.1308871
C time_of_dayEvening 0.2880832 0.2605779 1.1055548 0.2689192
C time_of_dayMidday 0.0551912 0.1950934 0.2828963 0.7772563
multi_aug2 <- augment(multi_mod_reduced2, new_data = train_subset)

ggplot(data = multi_aug2, aes(x = .pred_class, y = store)) + 
  geom_point() + 
  geom_jitter() +
  labs(x = "Predicted Store", 
       y = "Actual Store ", 
       title = "Store Visited: Predicted vs. Actual",
       subtitle = "Training Data: Reduced Model 2") +
  my_theme

multi_mod_reduced2 %>% 
  extract_fit_engine() %>% 
  check_collinearity()
## # Check for Multicollinearity
## 
## Low Correlation
## 
##           Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
##         gender 1.35 [1.25, 1.49]         1.16      0.74     [0.67, 0.80]
##  product_class 4.26 [3.79, 4.80]         2.06      0.23     [0.21, 0.26]
##     pay_method 1.79 [1.63, 1.98]         1.34      0.56     [0.50, 0.61]
##    time_of_day 1.91 [1.74, 2.12]         1.38      0.52     [0.47, 0.58]

The results don’t look much better. However, we have corrected the massively large VIF value! I would trust this second reduced model more than the original reduced model.

Linearity Between Log Odds and Predictors

We can check this by plotting each predictor against the log odds. I am going to skip this assumption since I am not confident with the model and since I will not be using the model for any inference.

Independent Errors

We may have clusters in the data depending on how the data was collected and on whom it was collected. It is very possible that we have multiple individuals represented here in this dataset. We may also have clusters of different ethnicities or other variables. I am not confident we meet this assumption either due to the variables we have versus those we do not have.

Normal Errors

diagnostics[[3]] +
  my_theme

Ideally, we’d like to see our errors normally distributed as well. According to this Q-Q Plot, we do not meet this assumption which questions our model results further.

Wrap Up

I think we are seeing a pattern over these guides of not quite being able to predict our outcomes as well as we’d like. Here, we had a few points of interest but it wasn’t enough to give us great prediction accuracy Even though our models are not “successful” per se, I would claim that we still are extracting valuable information about our data. Just knowing that this collection of variables does not provide enough information to accurately predict the store a customer visits is useful. We now know that we should 1) try a different outcome variable, 2) use different predictors, and/or 3) collect new data to help supplement what we have here. We also found some interesting patterns when looking at visualizations and the regression output.

The moral of the story is don’t fear when your results aren’t significant or aren’t what you’d hoped. Think about what you were able to gain from the process. A model that doesn’t work well is still progress and can tell you a lot about your data. Why didn’t the model work? What could you do to change the model or update the data? Every pattern, trend, failure, and success allows us to learn something and make changes and suggestions for the future.

References

Dykes, Bradford. n.d. STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.
Werth, Rose. n.d. 15 Multinomial Logit Regression (R) Categorical Regression in Stata and R. Accessed April 18, 2024. https://bookdown.org/sarahwerth2024/RegressionLabsBook/.