Welcome

Hey there! This is the sixth guide in my Introduction to Modeling with R series. In the previous guide, we looked at multinomial regression which involved a multi-level response variable. This page will focus on Poisson regression, a general linear model. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.

I again will start by importing our dataset as well as the packages we will be using. As mentioned in the previous, guide, I prefer to use the tidyverse family of packages. We’ll also be using the tidymodels collection of packages to set up the models and perform our train/test validation.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the same results each time I start a new R session and split the data. You may or may not wish to do this in your work. I want to make sure we get a roughly equal amount of each store in the split sets so I will take a stratified sample to make the split (strata = store).

library(tidyverse)
library(tidymodels)
library(poissonreg)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_clean.csv"))

set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)

train <- training(retail_split)
test <- testing(retail_split)

Alright! Now that we have that set, we’ll use the train data for all the model creation and will only use the test data when we are analyzing our models.

Generalized Linear Models

As you move forward in modeling, you may come across the term “generalized linear models”. This concept refers to the wide variety of models we can create with, according to Penn State University (“6.1 - Introduction to GLMs STAT 504” n.d.),

  1. A random component (how the response variable is distributed)
  2. A systematic component (linear relationships between predictors and some form of the outcome)
  3. A link function (the connection between 1 and 2)

All of the models we have covered so far are a specific case of a generalized linear model. SLR and MLR, where the response variable is continuous, are the simplest cases. The “link function” here is the expected value (mean) of the response and we assume the response variable comes from a normal distribution. We also have seen logistic regression and multinomial regression for when our response is categorical with two or more levels and where we assume the response comes from a binomial or multinomial distribution, respectively. These two use a link function of the log odds. This is why we have been predicting the log odds of an outcome rather than the outcome itself.

Generalized linear models also use something called the maximum likelihood to calculate the estimates (as opposed to the least squares method traditionally used in SLR and MLR). Here, we are choosing values for the model that produce results that are the most likely to match the data we already have. There are many other kinds of generalized linear models. As long as we meet the three criteria above, we can create a model. What if we want to assume our response variable comes from a Poisson distribution? A negative binomial? A chi-squared? All we need to complete the model is a link function. To demonstrate another example of a generalized linear model, we will explore Poisson regression in this guide.

Poisson Regression

As indicated above, we are interested in Poisson regression if we want to assume our response variable follows a Poisson distribution. (“Poisson” should be pronounced “pwah-soh” because it’s a French name but most people pronounce it “poy-sahn”.) A Poisson distribution is commonly used for variables that are counts as it predicts the probability that a certain event happens in period of time. In other words, we are counting something occurring and want to know how many will occur.

The formula for Poisson regression is simpler than logistic and multinomial regression:

\[ \log{\lambda} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ ... \] Instead of predicting the log odds of an occurrence, we are predicting the log average times (\(\lambda\)) some event occurs. Yes, we are predicting the average number of an occurrence. This can make model interpretation a little tricky but we’ll work through it below, once we get model output.

Requirements

The assumptions for Poisson regression are similar to logistic and multinomial regression.

  • The response variable must be a count variable with positive values. The predictors can be continuous or categorical.
  • Since we are assuming the response variable comes from a Poisson distribution, we are assuming the mean and variance of the response variable are the same.
  • The predictor variables must have a linear relationship with the log of the response variable.
  • The specific observations should be independent.
  • Ideally, the predictor variables should not be highly correlated.

Selecting Variables

We only have one variable that is a count variable and that is qty, or the number of a specific item purchased. That makes choosing the response variable easy!

As for the predictors, I once again am going to collect all the variables I think might play a role in predicting how many of a certain item a customer will buy.

  • store
  • customer_type
  • gender
  • product_class
  • pay_method
  • unit_price

I also am going to create a new variable that we used in the previous guide: time_of_day which will take on the values of either “midday”, “afternoon”, or “evening”.

train_subset <- train %>% 
  mutate(hour = hour(time),
         time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
                                 hour %in% c(15, 16, 17, 18) ~ "Afternoon",
                                 TRUE ~ "Evening")) %>% 
  select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)

test_subset <- test %>% 
  mutate(hour = hour(time),
         time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
                                 hour %in% c(15, 16, 17, 18) ~ "Afternoon",
                                 TRUE ~ "Evening")) %>% 
  select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)

Exploring the Variables

It’s usually a good idea to take a look at our variables and how they interact before we make a model. So let’s do that now!

train_subset %>%
  ggplot(aes(x = qty, fill = store)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold vs. Store Visited",
       fill = "Store") +
  my_theme

This variable might not have much of an effect on quantity sold. For the most part the quantity sold in each store is roughly proportional to the total amount of goods sold of that particular quantity. Some exceptions are

  • Less quantities of 5 and 6 in Store C
  • More quantities of 7 and 10 in Store C
  • More quantities of 5 in Store A
train_subset %>%
  ggplot(aes(x = qty, fill = customer_type)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold vs. Member Status",
       fill = "Member Status") +
  my_theme

There doesn’t appear to be a huge different between member status and quantities sold. It does seem like more members buy 10 of an item and more non members buy 8 or 9 of an item but these differences are not extraordinary.

train_subset %>%
  ggplot(aes(x = qty, fill = gender)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold by Gender",
       fill = "Gender") +
  my_theme

We could end up seeing an effect here. Just by looking, I see that more males buy 1, 4, and 7 quantities of goods and females buy 5, 6, 8, and 9 items at a time.

train_subset %>%
  ggplot(aes(x = qty, fill = product_class)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold vs. Product Class",
       fill = "Product Class") +
  my_theme

There is a lot going on here so it is hard to tell what role product class might end up playing in the model. However, I do notice that if someone is going to buy a fashion accessory, they seem to have a tendency to buy 1 or 2 of them. Conversely, if someone is going to buy a food or beverage, they have a higher tendency to buy 3 or 5 of them. However, none of these observations stand out as a solid pattern.

train_subset %>%
  ggplot(aes(x = qty, fill = pay_method)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold vs. Payment Method") +
  my_theme

Once again, I see a few things that may have an effect, but nothing that stands out.

train_subset %>%
  ggplot(aes(x = unit_price, fill = factor(qty))) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Paired") +
  labs(x = "Unit Price",
       y = "Quantity Sold",
       title = "Distribution of Quantities Sold vs. Unit Price",
       fill = "Quantity") +
  my_theme

It looks like there won’t be as much of an effect here as I was originally thinking. I thought that as unit price increased, it would be less likely that someone would buy more of that item. From this plot (and the plots above) it looks like this company sells a nice handful of 10 items at a time but there doesn’t seem to be a pattern between the price of one item. The range of unit prices vary from about $10 to $100 regardless of the quantity of that item sold.

train_subset %>%
  ggplot(aes(x = qty, fill = time_of_day)) +
  geom_bar() +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Quantity Sold in One Transaction",
       y = "Count",
       title = "Distribution of Quantities Sold by Time Period",
       fill = "Time of Day") +
  my_theme

Overall, most goods are sold midday and the fewest in the evening. But we do see that quantities of 1, 3, and 10 are more likely to be sold in the afternoon.

I’ll also take a look at all of the variables together just for good measure.

train_subset %>% 
  ggpairs() +
  my_theme

Plot matrix of all variables plotted against each other.

I don’t see any need for interaction variables or variable transformations.

Building the Model

Let’s try to build a model with all of the above variables. I do anticipate having to remove some variables due to a lack of use in predicting the number of goods sold. I have a good idea of what these variables will be but I would like to see what the p-values look like to help guide my decision making.

poi_spec <- poisson_reg()

poi_mod_full <- poi_spec %>%
  fit(qty ~ store + customer_type + gender + product_class + pay_method + unit_price + time_of_day, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(poi_mod_full, exp = TRUE) %>% 
  mutate(p.value = format.pval(p.value, digits = 2)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.9031001 0.0624539 28.4286060 < 2e-16
storeB 1.0142098 0.0373039 0.3782379 0.70525
storeC 1.0391716 0.0372186 1.0323833 0.30189
customer_typeNormal 1.0031840 0.0304003 0.1045685 0.91672
genderMale 0.9371310 0.0306640 -2.1175365 0.03421
product_classFashion accessories 0.8367606 0.0514901 -3.4611963 0.00054
product_classFood and beverages 0.8812484 0.0516387 -2.4480801 0.01436
product_classHealth and beauty 0.9409788 0.0521437 -1.1666740 0.24334
product_classHome and lifestyle 0.9307007 0.0518888 -1.3840682 0.16634
product_classSports and travel 0.9365300 0.0510866 -1.2835805 0.19929
pay_methodCredit card 1.0100251 0.0374698 0.2662202 0.79007
pay_methodEwallet 1.0066169 0.0370091 0.1782019 0.85856
unit_price 0.9996562 0.0005745 -0.5984694 0.54953
time_of_dayEvening 1.0655001 0.0441046 1.4384955 0.15029
time_of_dayMidday 1.0542168 0.0344765 1.5314229 0.12566

Alright, here is our model response! Not all of our p-values are significant (as expected) but let’s dig into why. The interpretation of this model can be a little strange. First, you may have noticed that I used exp = TRUE in the code. This will exponentiate our answers in the estimate column to help with interpretation. Remember we were predicting the log mean number of goods sold. Exponentiating the estimates allows us to interpret in terms of mean number of goods sold. However, the results here don’t actually give us our expected change in pure numbers, it gives it in relative risk, or percent change where the baseline is 1. Should we predict a value of 1 for a parameter (such as what we have for unit_price), we would predict the mean number of goods sold to not change at all based on the price of one item, holding all other variables constant.

Here are some interpretations from our data. Holding all other variables constant,

  • We would predict the mean number of goods purchased by a customer who is a male to be \(1 - 0.937 = 0.063*100 = 6.3\)% lower than if the customer was a female.
    • We would predict the mean number of goods purchased by a customer who is buying a food or beverage product to be \(1 - 0.881 = 0.119*100 = 11.9\)% lower than if the customer bought an electronic accessory (the baseline category).
  • We would predict the mean number of goods purchased by a customer in the evening to be 7% higher than if the customer bought goods in the morning.

We can use this information, as well as the p-values, to derive a reduced model. I want to avoid the variables whose estimates are at 1 or very close as this means they have little effect on predicting our response. Looking at the output, I’d say we keep gender, product_class, and time_of_day. I will that it’s interesting that unit_price doesn’t seem to have a role here as I would expect the price of an item to have some say in how many of that item are purchased.

poi_mod_reduced <- poi_spec %>%
  fit(qty ~ gender + product_class + time_of_day, data = train_subset) %>% 
  repair_call(data = train_subset) # this is necessary because we are not using cross-validation

tidy(poi_mod_reduced, exp = TRUE) %>% 
  mutate(p.value = format.pval(p.value, digits = 2)) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive"))
term estimate std.error statistic p.value
(Intercept) 5.9418974 0.0440945 40.413864 < 2e-16
genderMale 0.9345872 0.0305029 -2.217834 0.02657
product_classFashion accessories 0.8381205 0.0512081 -3.448544 0.00056
product_classFood and beverages 0.8812673 0.0514381 -2.457211 0.01400
product_classHealth and beauty 0.9425170 0.0519899 -1.138708 0.25482
product_classHome and lifestyle 0.9290891 0.0516756 -1.423314 0.15465
product_classSports and travel 0.9336724 0.0507896 -1.351253 0.17661
time_of_dayEvening 1.0669125 0.0439675 1.473112 0.14072
time_of_dayMidday 1.0543886 0.0344618 1.536805 0.12434

I like this model better, but let’s move forward with looking model performance before we make any final decisions.

Model Performance

model_performance(poi_mod_full) %>% 
  kable()
AIC AICc BIC R2_Nagelkerke RMSE Sigma Score_log Score_spherical
4079.757 4080.371 4150.008 0.0344889 2.883914 1 -2.534266 0.0310775
model_performance(poi_mod_reduced) %>% 
  kable()
AIC AICc BIC R2_Nagelkerke RMSE Sigma Score_log Score_spherical
4069.234 4069.463 4111.385 0.0322932 2.885788 1 -2.53519 0.0310738

There isn’t much that differs between the two models. Following the pattern we’ve seen before, we have slightly reduced the AIC and the BIC and our \(R^2\) and RMSE values have hardly increased. This is, to me, evidence to prefer the model with less variables as we can get almost the same performance as the full model.

Checking Assumptions

diagnostics <- plot(check_model(poi_mod_reduced, panel = FALSE, type = "discrete_dots"))
diagnostics[[4]] +
  my_theme

Leverage plot. No leverage value exceeds 0.02 and all fall bewteen the contour lines of 0.5 and -0.5.

The good news is that we don’t seem to have any influential observations or high leverage points.

The Response Variable

Our response is in fact a count variable. All values of the variable are positive integers. We also use valid predictors as they are all either continuous or categorical.

If our response variable qty truly has a population Poisson distribution, we should expect to see the mean and the variance be the same. This may not be exactly true for our sample since we have a sample and not the entire population but we would expect them to be close.

mean(train_subset$qty) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
x
5.485607
var(train_subset$qty) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
x
8.480682

Without running any statistical tests (such as a dispersion test or goodness of fit test), it does look like the variance and mean are different. I can’t say for certain, but we may be missing this assumption.

Linearity Between Log Response and Predictors

Since we assume a linear relationship between the predictors and the log of the response variable, we should make sure we see linearity there.

train_subset %>% 
  mutate(log_qty = log(qty)) %>% 
  select(-qty) %>% 
  ggpairs() +
  my_theme

First, we see that the response variable has a wonky distribution which is skewing the rest of the variables left. But I don’t see evidence for any non-linear relationships here, except for perhaps with unit_price.

Independence

We would like to see our observations be independent. We might also violate this assumption since we cannot confirm that a customer isn’t in the dataset multiple times. It may also be that the items one customer purchases have some dependency on what the previous customer purchased (for instance, if I see someone buy strawberry yogurt and I decide I want that as well).

Correlation Between Predictors

We ideally don’t want to see our predictors be correlated. We can use a diagnostic plot to look at this.

diagnostics[[5]] +
  my_theme

No VIF values are higher than 2.

Our VIFs are very low so I see no reason for concern here.

Wrap Up

Poisson regression can be a very useful tool if you are looking to predict counts of something or if you want to predict how often an event occurs in a specific time frame. This kind of problem doesn’t arise frequently but it is useful to know that it exists. Of course, you could use another form of generalized linear model if you have reason to suspect that your response variable follows a different distribution. As we saw, the main drawback to using Poisson regression is the interpretation as we have to remember we are predicting a relative risk rather than an actual count.

This guide wraps up all of the modeling that the series will cover. We will go over bootstrapping next which is a form of resampling and which uses a model but is not a model in and of itself. I feel like it might be helpful to create a little summary of the models we have covered here and when to use them.

Model Name Type of Response Type of Predictors Number of Predictors
Simple Linear Regression Continuous Continuous 1
Multiple Linear Regression Continuous Continuous, Categorical 2+
Simple Logistic Regression Binary Continuous 1
Multiple Logistic Regression Binary Continuous, Categorical 2+
Multinomial Regression Categorical w/ 3+ Levels Continuous, Categorical 2+
Poisson Regression Discrete Count Continuous, Categorical 2+
Generalized Linear Model Follows a Distribution Continuous, Categorical 2+

References

“6.1 - Introduction to GLMs STAT 504.” n.d. Accessed April 18, 2024. https://online.stat.psu.edu/stat504/lesson/6/6.1.
Dykes, Bradford. n.d. STA 631.” GitHub. Accessed April 18, 2024. https://github.com/gvsu-sta631.
Jabeen, Hafsa. 2019. “Learn to Use Poisson Regression in R.” Dataquest. https://www.dataquest.io/blog/tutorial-poisson-regression-in-r/.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.
Legler, Paul Roback and Julie. n.d. Chapter 4 Poisson Regression Beyond Multiple Linear Regression. Accessed April 18, 2024. https://bookdown.org/roback/bookdown-BeyondMLR/ch-poissonreg.html#sec-PoisInference.
“Poisson Regression / Regression of Counts: Definition.” 2016. Statistics How To. https://www.statisticshowto.com/poisson-regression/.