Welcome

Hey there! Welcome to my Modeling Guide with R. Over the course of the next few guides, you’ll get to see a few modeling concepts in action. The dataset used here was downloaded from Kaggle (Paye n.d.). It is unclear if this dataset was generated randomly or is based on real sales data; however, it has plenty of variables for us to explore.

This Guide

The guide you are reading dives into a generalized exploratory data analysis (EDA) of the dataset. Before we can jump to modeling, we have to make sure we have a good understanding of what’s going on in the dataset. It’s not a good idea to jump right into inference; if we can’t explain what the variables are trying to say, we won’t be able to provide any insight. Moreover, we may need to do some data cleaning and we also may have questions to bring back to the data provider.

Exploring the Data

Preparing / Cleaning Data

Let’s begin importing some packages and reading in the dataset. I rely heavily on the tidyverse family of packages and draw on other packages as needed. Here, we will be using the lubridate package to work with dates and times.

library(tidyverse)
library(lubridate)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_raw.csv"))
dim(retail)

## [1] 1000   17

head(retail) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive")) %>% 
  scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ")

Invoice ID	Branch	City	Customer type	Gender	Product line	Unit price	Quantity	Tax 5%	Total	Date	Time	Payment	cogs	gross margin percentage	gross income	Rating
750-67-8428	A	Yangon	Member	Female	Health and beauty	74.69	7	26.1415	548.9715	1/5/2019	13:08:00	Ewallet	522.83	4.761905	26.1415	9.1
226-31-3081	C	Naypyitaw	Normal	Female	Electronic accessories	15.28	5	3.8200	80.2200	3/8/2019	10:29:00	Cash	76.40	4.761905	3.8200	9.6
631-41-3108	A	Yangon	Normal	Male	Home and lifestyle	46.33	7	16.2155	340.5255	3/3/2019	13:23:00	Credit card	324.31	4.761905	16.2155	7.4
123-19-1176	A	Yangon	Member	Male	Health and beauty	58.22	8	23.2880	489.0480	1/27/2019	20:33:00	Ewallet	465.76	4.761905	23.2880	8.4
373-73-7910	A	Yangon	Normal	Male	Sports and travel	86.31	7	30.2085	634.3785	2/8/2019	10:37:00	Ewallet	604.17	4.761905	30.2085	5.3
699-14-3026	C	Naypyitaw	Normal	Male	Electronic accessories	85.39	7	29.8865	627.6165	3/25/2019	18:30:00	Ewallet	597.73	4.761905	29.8865	4.1

It looks like we have a dataset that has 1000 observations with 17 variables. This is not a ton of observations but is enough to present the modeling concepts.

Now that this is done, let’s look at the variable names and the kinds of variables we have in this dataset. For this, we can use the summary() function which will give us a brief understanding of the variables.

summary(retail)

##   Invoice ID           Branch              City           Customer type     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender          Product line         Unit price       Quantity    
##  Length:1000        Length:1000        Min.   :10.08   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:32.88   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :55.23   Median : 5.00  
##                                        Mean   :55.67   Mean   : 5.51  
##                                        3rd Qu.:77.94   3rd Qu.: 8.00  
##                                        Max.   :99.96   Max.   :10.00  
##      Tax 5%            Total             Date               Time         
##  Min.   : 0.5085   Min.   :  10.68   Length:1000        Length:1000      
##  1st Qu.: 5.9249   1st Qu.: 124.42   Class :character   Class1:hms       
##  Median :12.0880   Median : 253.85   Mode  :character   Class2:difftime  
##  Mean   :15.3794   Mean   : 322.97                      Mode  :numeric   
##  3rd Qu.:22.4453   3rd Qu.: 471.35                                       
##  Max.   :49.6500   Max.   :1042.65                                       
##    Payment               cogs        gross margin percentage  gross income    
##  Length:1000        Min.   : 10.17   Min.   :4.762           Min.   : 0.5085  
##  Class :character   1st Qu.:118.50   1st Qu.:4.762           1st Qu.: 5.9249  
##  Mode  :character   Median :241.76   Median :4.762           Median :12.0880  
##                     Mean   :307.59   Mean   :4.762           Mean   :15.3794  
##                     3rd Qu.:448.90   3rd Qu.:4.762           3rd Qu.:22.4453  
##                     Max.   :993.00   Max.   :4.762           Max.   :49.6500  
##      Rating      
##  Min.   : 4.000  
##  1st Qu.: 5.500  
##  Median : 7.000  
##  Mean   : 6.973  
##  3rd Qu.: 8.500  
##  Max.   :10.000

Here, I’m just looking at general features. I’m making sure that each of the character variables have length 1000 (which is how many rows we have) and that the numerical summaries have values that make sense (for example the Rating variable has a max of 10 which is logical). I’m also looking at any adjustments that I need to make to the data to make it easier to work with. Of course, to remain ethical analysts, we should never “edit” the data. Here, I’m talking about editing variable names, ensuring that each variable has the correct type (e.g., the Date variable should be encoded as a date), and checking for input errors.

After looking at the summary, I think I want to take the time to rename the variables. This isn’t always a necessary step, but since I will be using this data over the course of multiple guides I want to make sure I am comfortable with it. The code below:

Renames a handful of the variables using my preferred style: snake_case
Removes several variables that were miscalculated. The City variable is also removed as the same information can be conveyed with the Store ID and helps promote anonymity.
Edits several variables.

Coerces the date variable into a date object
Calculates the subtotal of a purchase before tax by dividing the total of the purchase by 1.05 (5% tax)
Calculates the unit price of an item by dividing the subtotal by the quantity.

retail <- retail %>% 
  rename(id = `Invoice ID`,
         store = Branch,
         customer_type = `Customer type`,
         gender = Gender,
         product_class = `Product line`,
         qty = Quantity,
         date = Date,
         time = Time,
         pay_method = Payment,
         rating = Rating,
         total = Total) %>% 
  select(-c(`Tax 5%`, cogs, `gross margin percentage`, `gross income`, `Unit price`, City)) %>% 
  mutate(date = mdy(date),
         subtotal = total / 1.05,
         unit_price = subtotal / qty
         )

head(retail) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive")) %>% 
  scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ")

id	store	customer_type	gender	product_class	qty	total	date	time	pay_method	rating	subtotal	unit_price
750-67-8428	A	Member	Female	Health and beauty	7	548.9715	2019-01-05	13:08:00	Ewallet	9.1	522.83	74.69
226-31-3081	C	Normal	Female	Electronic accessories	5	80.2200	2019-03-08	10:29:00	Cash	9.6	76.40	15.28
631-41-3108	A	Normal	Male	Home and lifestyle	7	340.5255	2019-03-03	13:23:00	Credit card	7.4	324.31	46.33
123-19-1176	A	Member	Male	Health and beauty	8	489.0480	2019-01-27	20:33:00	Ewallet	8.4	465.76	58.22
373-73-7910	A	Normal	Male	Sports and travel	7	634.3785	2019-02-08	10:37:00	Ewallet	5.3	604.17	86.31
699-14-3026	C	Normal	Male	Electronic accessories	7	627.6165	2019-03-25	18:30:00	Ewallet	4.1	597.73	85.39

Basic Exploratory Data Analysis

Specific data explorations will occur within the respective guides. However, I do think it is important to look at some of the variables to anticipate any potential problems we might have. It also helps to get a general visual understanding of what we are working with here.

Let’s start by creating some basic one-variable plots! I’ll follow each plot with a few things that I notice and that I may want to pay attention to later.

retail %>% ggplot(aes(store, label = store)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Store ID",
       y = "Count",
       title = "Counts of Transactions by Store") +
  my_theme

There are three stores and no missing values
There is not an equal number of stores represented (Store A > Store B > Store C)
All stores have at least 300 observations which is decent

retail %>% ggplot(aes(customer_type, label = customer_type)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Customer Type",
       y = "Count",
       title = "Counts of Each Customer Type") +
  my_theme

There are two customer types and no missing values
There is not an equal number of customers in each type (501 vs. 499)
- Perhaps this dataset was created with the intention to have and equal-ish number of members vs. non-members. By my own intuition, most stores would seem to have more non-members than members.

retail %>% ggplot(aes(gender, label = gender)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Gender",
       y = "Count",
       title = "Counts of Each Gender",
       subtitle = "Represented by This Dataset") +
  my_theme

No information on how this variable was calculated. It is possible that it was self-reported but could have been derived programmatically Again, I’ll note that it is not clear whether this dataset was randomly generated or is from the real world. There are also two genders represented here although more could be possible.
There are no missing values
There is not an equal number of customers in each gender represented (501 vs. 499)

retail %>% ggplot(aes(product_class, label = product_class)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Product Class",
       y = "Count",
       title = "Counts of Each Product Class") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  my_theme

There are six different product classes and no missing values
There are not an equal number of items across the classes
Each class has at least 150 items
The plot could contain items that have been counted twice (e.g., if two customers purchased the same item in separate transactions)

retail %>% ggplot(aes(qty)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Number of Items",
       title = "Distribution of The Quantity of an Item Sold",
       subtitle = "In a Single Transaction") +
  my_theme

A histogram might be more applicable here; however there are only 10 values in the qty variable
- So, this dataset contains items that were sold in a group of, at minimum, 1 and, at maximum, 10.
There are not an equal number of items in each quantity group
- E.g., in 119 different transactions, 10 of an item was sold

retail %>% ggplot(aes(total)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 50) +
  labs(x = "Transaction Total",
       y = "Count",
       title = "Distribution of Transaction Totals") +
  my_theme

The distribution is skewed right
Most of the transaction totals were less than $400
There may be a few transactions that were $0 (or super small)
- This is worth investigating later

retail %>% ggplot(aes(pay_method, label = pay_method)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Payment Method",
       y = "Count",
       title = "Counts of Each Payment Method") +
  my_theme

There are three different payment methods and no missing values
There is not an equal number of payment methods

retail %>% ggplot(aes(rating)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 0.5) +
  labs(x = "Rating",
       y = "Count",
       title = "Distribution of Transaction Rating") +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  my_theme

The distribution is roughly uniform with lower counts on the tails (ratings of 4 and 10) and has no missing values
No rating exceeds 10 or falls below 4 (which is suspect)

retail %>% ggplot(aes(unit_price)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 5) +
  labs(x = "Price",
       y = "Count",
       title = "Distribution of Unit Price",
       subtitle = "The Price of a Single Item") +
  my_theme

The distribution is very roughly uniform with lower counts on the left tail (price less than 10) and has no missing values
The dataset page indicates the price is in dollars but it is unclear which country’s dollar.

The last step here will be to export our edited dataset so we can import it into other guides later on.

write_csv(retail, here::here("data/retail_clean.csv"))

Wrap Up

That will concludes this preliminary guide on preparing our data! Over the course of the next guides, we will explore the data a little bit more and create some predictive models.

References

Paye, Aung. n.d. “Supermarket Sales.” Accessed April 18, 2024. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales.

Exploratory Data Analysis and Cleaning

Ian Curtis