Hey there! This is the sixth guide in my Introduction to Modeling with R series. In the previous guide, we looked at multinomial regression which involved a multi-level response variable. This page will focus on Poisson regression, a general linear model. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.
I again will start by importing our dataset as well as the packages
we will be using. As mentioned in the previous, guide, I prefer to use
the tidyverse
family of packages. We’ll also be using the
tidymodels
collection of packages to set up the models and
perform our train/test validation.
The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.
I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (strata = store
).
library(tidyverse)
library(tidymodels)
library(poissonreg)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
Alright! Now that we have that set, we’ll use the train
data for all the model creation and will only use the test
data when we are analyzing our models.
As you move forward in modeling, you may come across the term “generalized linear models”. This concept refers to the wide variety of models we can create with, according to Penn State University (“6.1 - Introduction to GLMs STAT 504” n.d.),
All of the models we have covered so far are a specific case of a generalized linear model. SLR and MLR, where the response variable is continuous, are the simplest cases. The “link function” here is the expected value (mean) of the response and we assume the response variable comes from a normal distribution. We also have seen logistic regression and multinomial regression for when our response is categorical with two or more levels and where we assume the response comes from a binomial or multinomial distribution, respectively. These two use a link function of the log odds. This is why we have been predicting the log odds of an outcome rather than the outcome itself.
Generalized linear models also use something called the maximum likelihood to calculate the estimates (as opposed to the least squares method traditionally used in SLR and MLR). Here, we are choosing values for the model that produce results that are the most likely to match the data we already have. There are many other kinds of generalized linear models. As long as we meet the three criteria above, we can create a model. What if we want to assume our response variable comes from a Poisson distribution? A negative binomial? A chi-squared? All we need to complete the model is a link function. To demonstrate another example of a generalized linear model, we will explore Poisson regression in this guide.
As indicated above, we are interested in Poisson regression if we want to assume our response variable follows a Poisson distribution. (“Poisson” should be pronounced “pwah-soh” because it’s a French name but most people pronounce it “poy-sahn”.) A Poisson distribution is commonly used for variables that are counts as it predicts the probability that a certain event happens in period of time. In other words, we are counting something occurring and want to know how many will occur.
The formula for Poisson regression is simpler than logistic and multinomial regression:
\[ \log{\lambda} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ ... \] Instead of predicting the log odds of an occurrence, we are predicting the log average times (\(\lambda\)) some event occurs. Yes, we are predicting the average number of an occurrence. This can make model interpretation a little tricky but we’ll work through it below, once we get model output.
The assumptions for Poisson regression are similar to logistic and multinomial regression.
We only have one variable that is a count variable and that is
qty
, or the number of a specific item purchased. That makes
choosing the response variable easy!
As for the predictors, I once again am going to collect all the variables I think might play a role in predicting how many of a certain item a customer will buy.
store
customer_type
gender
product_class
pay_method
unit_price
I also am going to create a new variable that we used in the previous
guide: time_of_day
which will take on the values of either
“midday”, “afternoon”, or “evening”.
train_subset <- train %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
test_subset <- test %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
It’s usually a good idea to take a look at our variables and how they interact before we make a model. So let’s do that now!
train_subset %>%
ggplot(aes(x = qty, fill = store)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Store Visited",
fill = "Store") +
my_theme
This variable might not have much of an effect on quantity sold. For the most part the quantity sold in each store is roughly proportional to the total amount of goods sold of that particular quantity. Some exceptions are
train_subset %>%
ggplot(aes(x = qty, fill = customer_type)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Member Status",
fill = "Member Status") +
my_theme
There doesn’t appear to be a huge different between member status and quantities sold. It does seem like more members buy 10 of an item and more non members buy 8 or 9 of an item but these differences are not extraordinary.
train_subset %>%
ggplot(aes(x = qty, fill = gender)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold by Gender",
fill = "Gender") +
my_theme
We could end up seeing an effect here. Just by looking, I see that more males buy 1, 4, and 7 quantities of goods and females buy 5, 6, 8, and 9 items at a time.
train_subset %>%
ggplot(aes(x = qty, fill = product_class)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Product Class",
fill = "Product Class") +
my_theme
There is a lot going on here so it is hard to tell what role product class might end up playing in the model. However, I do notice that if someone is going to buy a fashion accessory, they seem to have a tendency to buy 1 or 2 of them. Conversely, if someone is going to buy a food or beverage, they have a higher tendency to buy 3 or 5 of them. However, none of these observations stand out as a solid pattern.
train_subset %>%
ggplot(aes(x = qty, fill = pay_method)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Payment Method") +
my_theme