Welcome

Hey there! Welcome to my Modeling Guide with R. Over the course of the next few guides, you’ll get to see a few modeling concepts in action. The dataset used here was downloaded from Kaggle (Paye n.d.). It is unclear if this dataset was generated randomly or is based on real sales data; however, it has plenty of variables for us to explore.

This Guide

The guide you are reading dives into a generalized exploratory data analysis (EDA) of the dataset. Before we can jump to modeling, we have to make sure we have a good understanding of what’s going on in the dataset. It’s not a good idea to jump right into inference; if we can’t explain what the variables are trying to say, we won’t be able to provide any insight. Moreover, we may need to do some data cleaning and we also may have questions to bring back to the data provider.

Exploring the Data

Preparing / Cleaning Data

Let’s begin importing some packages and reading in the dataset. I rely heavily on the tidyverse family of packages and draw on other packages as needed. Here, we will be using the lubridate package to work with dates and times.

library(tidyverse)
library(lubridate)
library(knitr)
library(kableExtra)

retail <- read_csv(here::here("data/retail_raw.csv"))
dim(retail)
## [1] 1000   17
head(retail) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive")) %>% 
  scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ") 
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating
750-67-8428 A Yangon Member Female Health and beauty 74.69 7 26.1415 548.9715 1/5/2019 13:08:00 Ewallet 522.83 4.761905 26.1415 9.1
226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3.8200 80.2200 3/8/2019 10:29:00 Cash 76.40 4.761905 3.8200 9.6
631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 16.2155 340.5255 3/3/2019 13:23:00 Credit card 324.31 4.761905 16.2155 7.4
123-19-1176 A Yangon Member Male Health and beauty 58.22 8 23.2880 489.0480 1/27/2019 20:33:00 Ewallet 465.76 4.761905 23.2880 8.4
373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 30.2085 634.3785 2/8/2019 10:37:00 Ewallet 604.17 4.761905 30.2085 5.3
699-14-3026 C Naypyitaw Normal Male Electronic accessories 85.39 7 29.8865 627.6165 3/25/2019 18:30:00 Ewallet 597.73 4.761905 29.8865 4.1

It looks like we have a dataset that has 1000 observations with 17 variables. This is not a ton of observations but is enough to present the modeling concepts.

Now that this is done, let’s look at the variable names and the kinds of variables we have in this dataset. For this, we can use the summary() function which will give us a brief understanding of the variables.

summary(retail)
##   Invoice ID           Branch              City           Customer type     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender          Product line         Unit price       Quantity    
##  Length:1000        Length:1000        Min.   :10.08   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:32.88   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :55.23   Median : 5.00  
##                                        Mean   :55.67   Mean   : 5.51  
##                                        3rd Qu.:77.94   3rd Qu.: 8.00  
##                                        Max.   :99.96   Max.   :10.00  
##      Tax 5%            Total             Date               Time         
##  Min.   : 0.5085   Min.   :  10.68   Length:1000        Length:1000      
##  1st Qu.: 5.9249   1st Qu.: 124.42   Class :character   Class1:hms       
##  Median :12.0880   Median : 253.85   Mode  :character   Class2:difftime  
##  Mean   :15.3794   Mean   : 322.97                      Mode  :numeric   
##  3rd Qu.:22.4453   3rd Qu.: 471.35                                       
##  Max.   :49.6500   Max.   :1042.65                                       
##    Payment               cogs        gross margin percentage  gross income    
##  Length:1000        Min.   : 10.17   Min.   :4.762           Min.   : 0.5085  
##  Class :character   1st Qu.:118.50   1st Qu.:4.762           1st Qu.: 5.9249  
##  Mode  :character   Median :241.76   Median :4.762           Median :12.0880  
##                     Mean   :307.59   Mean   :4.762           Mean   :15.3794  
##                     3rd Qu.:448.90   3rd Qu.:4.762           3rd Qu.:22.4453  
##                     Max.   :993.00   Max.   :4.762           Max.   :49.6500  
##      Rating      
##  Min.   : 4.000  
##  1st Qu.: 5.500  
##  Median : 7.000  
##  Mean   : 6.973  
##  3rd Qu.: 8.500  
##  Max.   :10.000

Here, I’m just looking at general features. I’m making sure that each of the character variables have length 1000 (which is how many rows we have) and that the numerical summaries have values that make sense (for example the Rating variable has a max of 10 which is logical). I’m also looking at any adjustments that I need to make to the data to make it easier to work with. Of course, to remain ethical analysts, we should never “edit” the data. Here, I’m talking about editing variable names, ensuring that each variable has the correct type (e.g., the Date variable should be encoded as a date), and checking for input errors.

After looking at the summary, I think I want to take the time to rename the variables. This isn’t always a necessary step, but since I will be using this data over the course of multiple guides I want to make sure I am comfortable with it. The code below:

  1. Renames a handful of the variables using my preferred style: snake_case
  2. Removes several variables that were miscalculated. The City variable is also removed as the same information can be conveyed with the Store ID and helps promote anonymity.
  3. Edits several variables.
  1. Coerces the date variable into a date object
  2. Calculates the subtotal of a purchase before tax by dividing the total of the purchase by 1.05 (5% tax)
  3. Calculates the unit price of an item by dividing the subtotal by the quantity.
retail <- retail %>% 
  rename(id = `Invoice ID`,
         store = Branch,
         customer_type = `Customer type`,
         gender = Gender,
         product_class = `Product line`,
         qty = Quantity,
         date = Date,
         time = Time,
         pay_method = Payment,
         rating = Rating,
         total = Total) %>% 
  select(-c(`Tax 5%`, cogs, `gross margin percentage`, `gross income`, `Unit price`, City)) %>% 
  mutate(date = mdy(date),
         subtotal = total / 1.05,
         unit_price = subtotal / qty
         )

head(retail) %>% 
  kable() %>% 
  kable_styling(c("striped", "responsive")) %>% 
  scroll_box(width = "100%", box_css = "border: 2px solid black; padding: 5px; ")
id store customer_type gender product_class qty total date time pay_method rating subtotal unit_price
750-67-8428 A Member Female Health and beauty 7 548.9715 2019-01-05 13:08:00 Ewallet 9.1 522.83 74.69
226-31-3081 C Normal Female Electronic accessories 5 80.2200 2019-03-08 10:29:00 Cash 9.6 76.40 15.28
631-41-3108 A Normal Male Home and lifestyle 7 340.5255 2019-03-03 13:23:00 Credit card 7.4 324.31 46.33
123-19-1176 A Member Male Health and beauty 8 489.0480 2019-01-27 20:33:00 Ewallet 8.4 465.76 58.22
373-73-7910 A Normal Male Sports and travel 7 634.3785 2019-02-08 10:37:00 Ewallet 5.3 604.17 86.31
699-14-3026 C Normal Male Electronic accessories 7 627.6165 2019-03-25 18:30:00 Ewallet 4.1 597.73 85.39

Basic Exploratory Data Analysis

Specific data explorations will occur within the respective guides. However, I do think it is important to look at some of the variables to anticipate any potential problems we might have. It also helps to get a general visual understanding of what we are working with here.

Let’s start by creating some basic one-variable plots! I’ll follow each plot with a few things that I notice and that I may want to pay attention to later.

retail %>% ggplot(aes(store, label = store)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Store ID",
       y = "Count",
       title = "Counts of Transactions by Store") +
  my_theme

retail %>% ggplot(aes(customer_type, label = customer_type)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Customer Type",
       y = "Count",
       title = "Counts of Each Customer Type") +
  my_theme

retail %>% ggplot(aes(gender, label = gender)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Gender",
       y = "Count",
       title = "Counts of Each Gender",
       subtitle = "Represented by This Dataset") +
  my_theme

retail %>% ggplot(aes(product_class, label = product_class)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Product Class",
       y = "Count",
       title = "Counts of Each Product Class") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  my_theme

retail %>% ggplot(aes(qty)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Number of Items",
       title = "Distribution of The Quantity of an Item Sold",
       subtitle = "In a Single Transaction") +
  my_theme

retail %>% ggplot(aes(total)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 50) +
  labs(x = "Transaction Total",
       y = "Count",
       title = "Distribution of Transaction Totals") +
  my_theme

retail %>% ggplot(aes(pay_method, label = pay_method)) +
  geom_bar(fill = "#099392") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 2) +
  labs(x = "Payment Method",
       y = "Count",
       title = "Counts of Each Payment Method") +
  my_theme

retail %>% ggplot(aes(rating)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 0.5) +
  labs(x = "Rating",
       y = "Count",
       title = "Distribution of Transaction Rating") +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  my_theme

retail %>% ggplot(aes(unit_price)) +
  geom_histogram(fill = "#099392", color = "black", binwidth = 5) +
  labs(x = "Price",
       y = "Count",
       title = "Distribution of Unit Price",
       subtitle = "The Price of a Single Item") +
  my_theme

The last step here will be to export our edited dataset so we can import it into other guides later on.

write_csv(retail, here::here("data/retail_clean.csv"))

Wrap Up

That will concludes this preliminary guide on preparing our data! Over the course of the next guides, we will explore the data a little bit more and create some predictive models.

References

Paye, Aung. n.d. “Supermarket Sales.” Accessed April 18, 2024. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales.