Hey there! This is the sixth guide in my Introduction to Modeling with R series. In the previous guide, we looked at multinomial regression which involved a multi-level response variable. This page will focus on Poisson regression, a general linear model. Although, I cannot cover all of the details and fine use cases of regression, we will explore some of the key ideas to focus on when creating and analyzing a regression model. We’ll start by importing our data and identifying variables to use in the model then move to creating, interpreting, and testing the model. As before, we’ll use train/test validation and we’ll also look at a few candidate models.

I again will start by importing our dataset as well as the packages
we will be using. As mentioned in the previous, guide, I prefer to use
the `tidyverse`

family of packages. We’ll also be using the
`tidymodels`

collection of packages to set up the models and
perform our train/test validation.

The code below imports our packages and the data we cleaned in the previous guide. It also splits the data into a train and test set. We will train our models using the training set and will test its performance using the test set. We do this to simulate how the model will perform in the “real world”.

I will be setting the seed for the random generator so we can get the
same results each time I start a new R session and split the data. You
may or may not wish to do this in your work. I want to make sure we get
a roughly equal amount of each store in the split sets so I will take a
stratified sample to make the split (`strata = store`

).

```
library(tidyverse)
library(tidymodels)
library(poissonreg)
library(GGally)
library(performance)
library(knitr)
library(kableExtra)
retail <- read_csv(here::here("data/retail_clean.csv"))
set.seed(52319)
retail_split <- initial_split(retail, prop = .8, strata = store)
train <- training(retail_split)
test <- testing(retail_split)
```

Alright! Now that we have that set, we’ll use the `train`

data for all the model creation and will only use the `test`

data when we are analyzing our models.

As you move forward in modeling, you may come across the term “generalized linear models”. This concept refers to the wide variety of models we can create with, according to Penn State University (“6.1 - Introduction to GLMs STAT 504” n.d.),

- A random component (how the response variable is distributed)
- A systematic component (linear relationships between predictors and some form of the outcome)
- A link function (the connection between 1 and 2)

All of the models we have covered so far are a specific case of a generalized linear model. SLR and MLR, where the response variable is continuous, are the simplest cases. The “link function” here is the expected value (mean) of the response and we assume the response variable comes from a normal distribution. We also have seen logistic regression and multinomial regression for when our response is categorical with two or more levels and where we assume the response comes from a binomial or multinomial distribution, respectively. These two use a link function of the log odds. This is why we have been predicting the log odds of an outcome rather than the outcome itself.

Generalized linear models also use something called the maximum likelihood to calculate the estimates (as opposed to the least squares method traditionally used in SLR and MLR). Here, we are choosing values for the model that produce results that are the most likely to match the data we already have. There are many other kinds of generalized linear models. As long as we meet the three criteria above, we can create a model. What if we want to assume our response variable comes from a Poisson distribution? A negative binomial? A chi-squared? All we need to complete the model is a link function. To demonstrate another example of a generalized linear model, we will explore Poisson regression in this guide.

As indicated above, we are interested in Poisson regression if we
want to assume our response variable follows a Poisson distribution.
(“Poisson” *should* be pronounced “pwah-soh” because it’s a
French name but most people pronounce it “poy-sahn”.) A Poisson
distribution is commonly used for variables that are counts as it
predicts the probability that a certain event happens in period of time.
In other words, we are counting something occurring and want to know how
many will occur.

The formula for Poisson regression is simpler than logistic and multinomial regression:

\[
\log{\lambda} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ ...
\] Instead of predicting the log odds of an occurrence, we are
predicting the log *average* times (\(\lambda\)) some event occurs. Yes, we are
predicting the average number of an occurrence. This can make model
interpretation a little tricky but we’ll work through it below, once we
get model output.

The assumptions for Poisson regression are similar to logistic and multinomial regression.

- The response variable must be a count variable with positive values. The predictors can be continuous or categorical.
- Since we are assuming the response variable comes from a Poisson distribution, we are assuming the mean and variance of the response variable are the same.
- The predictor variables must have a linear relationship with the log of the response variable.
- The specific observations should be independent.
- Ideally, the predictor variables should not be highly correlated.

We only have one variable that is a count variable and that is
`qty`

, or the number of a specific item purchased. That makes
choosing the response variable easy!

As for the predictors, I once again am going to collect all the variables I think might play a role in predicting how many of a certain item a customer will buy.

`store`

`customer_type`

`gender`

`product_class`

`pay_method`

`unit_price`

I also am going to create a new variable that we used in the previous
guide: `time_of_day`

which will take on the values of either
“midday”, “afternoon”, or “evening”.

```
train_subset <- train %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
test_subset <- test %>%
mutate(hour = hour(time),
time_of_day = case_when(hour %in% c(10, 11, 12, 13, 14) ~ "Midday",
hour %in% c(15, 16, 17, 18) ~ "Afternoon",
TRUE ~ "Evening")) %>%
select(qty, store, customer_type, gender, product_class, pay_method, unit_price, time_of_day)
```

It’s usually a good idea to take a look at our variables and how they interact before we make a model. So let’s do that now!

```
train_subset %>%
ggplot(aes(x = qty, fill = store)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Store Visited",
fill = "Store") +
my_theme
```

This variable might not have much of an effect on quantity sold. For the most part the quantity sold in each store is roughly proportional to the total amount of goods sold of that particular quantity. Some exceptions are

- Less quantities of 5 and 6 in Store C
- More quantities of 7 and 10 in Store C
- More quantities of 5 in Store A

```
train_subset %>%
ggplot(aes(x = qty, fill = customer_type)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Member Status",
fill = "Member Status") +
my_theme
```

There doesn’t appear to be a huge different between member status and quantities sold. It does seem like more members buy 10 of an item and more non members buy 8 or 9 of an item but these differences are not extraordinary.

```
train_subset %>%
ggplot(aes(x = qty, fill = gender)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold by Gender",
fill = "Gender") +
my_theme
```

We could end up seeing an effect here. Just by looking, I see that more males buy 1, 4, and 7 quantities of goods and females buy 5, 6, 8, and 9 items at a time.

```
train_subset %>%
ggplot(aes(x = qty, fill = product_class)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Product Class",
fill = "Product Class") +
my_theme
```

There is a lot going on here so it is hard to tell what role product class might end up playing in the model. However, I do notice that if someone is going to buy a fashion accessory, they seem to have a tendency to buy 1 or 2 of them. Conversely, if someone is going to buy a food or beverage, they have a higher tendency to buy 3 or 5 of them. However, none of these observations stand out as a solid pattern.

```
train_subset %>%
ggplot(aes(x = qty, fill = pay_method)) +
geom_bar() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Quantity Sold in One Transaction",
y = "Count",
title = "Distribution of Quantities Sold vs. Payment Method") +
my_theme
```