Introduction

The objective of this analysis is to perform cluster analysis (e.g., k-means and/or hierarchical clustering) on a data set which contains transactions occurring between 1-Dec-2010 and 30-Nov-2011 for a UK-based online retail store. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. Specifically the aim is to identify whether this company’s costumers can be segmented meaningfully by the recency and duration of their interaction with the company, the frequency of their purchases/orders, and the amount they spent (amounts are in Sterling pounds). Further segmentation by country would also be very helpful for marketing purposes.

Data

The data set we will be analyzing is called “Online Retail”. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. There are 541909 observations with 8 variables Many customers of the company are wholesalers. The data is sourced from: Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Attribute Information:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

Data Preparation

Before we start to analyze our data, we will first need to clean it up and get it ready for analysis.
Import the library and data:

library(readxl)
setwd("C:/Users/crzys/Documents")
eretail = read_excel("Online Retail.xlsx")
dim(eretail)
## [1] 541909      8
names(eretail)
## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"

Clean up data:

eretail = eretail[eretail$Country != "Unspecified",] # remove 'unspecified' country
eretail = eretail[eretail$Quantity > 0,]             # remove returns/cancellations

IDtab = table(eretail$Country, eretail$CustomerID)   # crosstab country by customer ID
IDtab = apply(IDtab >0, 2, sum)                      # is any customer ID duplicated across countries?
duplicateIDs = names(IDtab[IDtab > 1])               # duplicate IDs to clean up
eretail = eretail[!is.element(eretail$CustomerID, duplicateIDs),]
rm(IDtab)

eretail$InvoiceMth = substr(eretail$InvoiceDate, 1, 7)         # extract month of invoice
eretail = eretail[as.vector(eretail$InvoiceMth) != "2011-12",] # remove December 2011 as it only covers first week

eretail$Amount = eretail$Quantity * eretail$UnitPrice           # compute amount per invoice item

eaggr = aggregate(Amount~Country+CustomerID, data=eretail, sum) # compute aggregate amount spent per customer
row.names(eaggr) = eaggr$CustomerID
eaggr = eaggr[,-2]
eaggr = cbind(eaggr, aggregate(InvoiceMth~CustomerID, data=eretail, min)[,-1]) # 1st month of customer interaction
names(eaggr)[3] = "FirstMth"
eaggr = cbind(eaggr, aggregate(InvoiceMth~CustomerID, data=eretail, max)[,-1]) # last month of cust. interaction
names(eaggr)[4] = "LastMth"

Re-label months and compute duration of customer interaction:

levels(eaggr$FirstMth) = 1:12
levels(eaggr$LastMth) = 1:12
eaggr$FirstMth = as.numeric(as.vector(eaggr$FirstMth))
eaggr$LastMth = as.numeric(as.vector(eaggr$LastMth))
eaggr$Months = eaggr$LastMth - eaggr$FirstMth + 1

eaggr = cbind(eaggr, apply( table(eretail$CustomerID, eretail$InvoiceMth) , 1, sum ) )
names(eaggr)[6] = "Purchases"

Useful statistics (which we may or may not decide to use):

eaggr$Amount.per.Purchase = eaggr$Amount / eaggr$Purchases
eaggr$Purchases.per.Month = eaggr$Purchases / eaggr$Months
eaggr$Amount.per.Month = eaggr$Amount / eaggr$Months

eaggr[1:30,]
##              Country   Amount FirstMth LastMth Months Purchases
## 12346 United Kingdom 77183.60        2       2      1         1
## 12347        Iceland  4085.18        1      11     11       171
## 12348        Finland  1797.24        1      10     10        31
## 12349          Italy  1757.55       12      12      1        73
## 12350         Norway   334.40        3       3      1        17
## 12352         Norway  2506.04        3      12     10        85
## 12353        Bahrain    89.00        6       6      1         4
## 12354          Spain  1079.40        5       5      1        58
## 12355        Bahrain   459.40        6       6      1        13
## 12356       Portugal  2811.43        2      12     11        59
## 12357    Switzerland  6207.67       12      12      1       131
## 12358        Austria   484.86        8       8      1        12
## 12359         Cyprus  6372.58        2      11     10       248
## 12360        Austria  2662.06        6      11      6       129
## 12361        Belgium   189.90        3       3      1        10
## 12362        Belgium  4697.19        3      12     10       236
## 12364        Belgium  1002.78        9      11      3        58
## 12365         Cyprus   641.38        3       3      1        22
## 12371    Switzerland  1887.96       11      11      1        63
## 12372        Denmark  1298.04        3      10      8        52
## 12373        Austria   364.60        3       3      1        14
## 12374        Austria   742.93       12      12      1        33
## 12375        Finland   457.50       10      12      3        17
## 12377    Switzerland  1628.12        1       2      2        77
## 12378    Switzerland  4008.62        9       9      1       219
## 12379        Belgium   852.24        7      10      4        40
## 12380        Belgium  2724.81        7      12      6       104
## 12381         Norway  1698.30        9      12      4        80
## 12383        Belgium  1850.56        1       7      7        99
## 12384    Switzerland   585.27        9      12      4        27
##       Amount.per.Purchase Purchases.per.Month Amount.per.Month
## 12346         77183.60000            1.000000       77183.6000
## 12347            23.88994           15.545455         371.3800
## 12348            57.97548            3.100000         179.7240
## 12349            24.07603           73.000000        1757.5500
## 12350            19.67059           17.000000         334.4000
## 12352            29.48282            8.500000         250.6040
## 12353            22.25000            4.000000          89.0000
## 12354            18.61034           58.000000        1079.4000
## 12355            35.33846           13.000000         459.4000
## 12356            47.65136            5.363636         255.5845
## 12357            47.38679          131.000000        6207.6700
## 12358            40.40500           12.000000         484.8600
## 12359            25.69589           24.800000         637.2580
## 12360            20.63612           21.500000         443.6767
## 12361            18.99000           10.000000         189.9000
## 12362            19.90335           23.600000         469.7190
## 12364            17.28931           19.333333         334.2600
## 12365            29.15364           22.000000         641.3800
## 12371            29.96762           63.000000        1887.9600
## 12372            24.96231            6.500000         162.2550
## 12373            26.04286           14.000000         364.6000
## 12374            22.51303           33.000000         742.9300
## 12375            26.91176            5.666667         152.5000
## 12377            21.14442           38.500000         814.0600
## 12378            18.30420          219.000000        4008.6200
## 12379            21.30600           10.000000         213.0600
## 12380            26.20010           17.333333         454.1350
## 12381            21.22875           20.000000         424.5750
## 12383            18.69253           14.142857         264.3657
## 12384            21.67667            6.750000         146.3175

Exploratory Data Analysis

Let us take a deeper dive into our dataset

summary(eaggr)
##    Country              Amount            FirstMth        LastMth     
##  Length:4286        Min.   :     0.0   Min.   : 1.00   Min.   : 1.00  
##  Class :character   1st Qu.:   303.9   1st Qu.: 2.00   1st Qu.: 8.00  
##  Mode  :character   Median :   656.6   Median : 5.00   Median :11.00  
##                     Mean   :  1952.1   Mean   : 5.45   Mean   : 9.47  
##                     3rd Qu.:  1593.3   3rd Qu.: 9.00   3rd Qu.:12.00  
##                     Max.   :268478.0   Max.   :12.00   Max.   :12.00  
##      Months        Purchases       Amount.per.Purchase Purchases.per.Month
##  Min.   : 1.00   Min.   :   1.00   Min.   :    0.00    Min.   :   0.1818  
##  1st Qu.: 1.00   1st Qu.:  17.00   1st Qu.:   12.34    1st Qu.:   6.6667  
##  Median : 4.00   Median :  40.00   Median :   17.69    Median :  13.0000  
##  Mean   : 5.02   Mean   :  88.54   Mean   :   54.56    Mean   :  20.4697  
##  3rd Qu.: 9.00   3rd Qu.:  97.00   3rd Qu.:   24.82    3rd Qu.:  24.0000  
##  Max.   :12.00   Max.   :7376.00   Max.   :77183.60    Max.   :1145.5000  
##  Amount.per.Month 
##  Min.   :    0.0  
##  1st Qu.:  122.0  
##  Median :  221.5  
##  Mean   :  400.3  
##  3rd Qu.:  388.4  
##  Max.   :77183.6
sum(is.na(eaggr))
## [1] 0

We see that there are no outliers or any NA’s.

How is the structure of our data?

str(eaggr)
## 'data.frame':    4286 obs. of  9 variables:
##  $ Country            : chr  "United Kingdom" "Iceland" "Finland" "Italy" ...
##  $ Amount             : num  77184 4085 1797 1758 334 ...
##  $ FirstMth           : num  2 1 1 12 3 3 6 5 6 2 ...
##  $ LastMth            : num  2 11 10 12 3 12 6 5 6 12 ...
##  $ Months             : num  1 11 10 1 1 10 1 1 1 11 ...
##  $ Purchases          : int  1 171 31 73 17 85 4 58 13 59 ...
##  $ Amount.per.Purchase: num  77183.6 23.9 58 24.1 19.7 ...
##  $ Purchases.per.Month: num  1 15.5 3.1 73 17 ...
##  $ Amount.per.Month   : num  77184 371 180 1758 334 ...

There are 4286 observations with 9 variables.

Analysis

Let us begin clustering our data. We first want to be able to cluster the customers by their recency, which is the most recent month they made a purchase, and duration of their interaction with the company.
First we will set a seed so that our example will be reproducible.

set.seed(1)

We removed the country variable as it is not a numerical number and will give us an error when we perform K-means.

eaggr1 = eaggr[,2:9]

Since there are 12 levels for the recency variable (eagrr1$LastMth) and we want to segment the customers, we will let the number of clusters be 12. We will first perform k-means clustering to see how well it performs grouping our data:

km.out = kmeans(eaggr1, 12, 100)

Let us see the within cluster sum of squares:

km.out$withinss
##  [1]  188351859          0  268466669 1613198483  207055871  185479600
##  [7] 3075186164  249602666  327124157  431530577 1491863833  128761589
km.out$tot.withinss
## [1] 8166621469

We see that our within cluster sum of squares is very large. There should be a more effective way to perform customer segmentation.

Here is the average variable value for each cluster and distribution of the number of observations in each cluster:

km.out$centers
##         Amount FirstMth   LastMth    Months  Purchases Amount.per.Purchase
## 1    6393.2455 2.027211 11.591837 10.564626  308.43537            60.50109
## 2   77183.6000 2.000000  2.000000  1.000000    1.00000         77183.60000
## 3    3403.5227 3.069307 11.329208  9.259901  204.50248            47.22644
## 4   36739.1270 1.700000 10.800000 10.100000 1294.70000          1913.22265
## 5  126380.1000 1.333333 12.000000 11.666667 2491.33333            94.61492
## 6   11473.2995 1.678571 11.785714 11.107143  368.42857            54.98193
## 7  236546.0500 1.000000 12.000000 12.000000  911.33333           441.39913
## 8     407.6194 6.461538  8.576923  3.115385   32.95551            24.73211
## 9    1554.9430 4.589212 10.577801  6.988589  100.00830            34.37656
## 10  20616.3256 1.520000 11.920000 11.400000  363.12000           121.75286
## 11  59861.4217 1.750000 12.000000 11.250000 1111.00000           199.62687
## 12   6908.1956 9.111111  9.111111  1.000000  169.11111            78.80309
##    Purchases.per.Month Amount.per.Month
## 1             29.11683         637.1908
## 2              1.00000       77183.6000
## 3             25.35381         458.6645
## 4            108.33333        7462.8447
## 5            209.41919       10847.1157
## 6             33.08742        1071.4156
## 7             75.94444       19712.1708
## 8             17.34540         223.2173
## 9             19.95807         377.6030
## 10            31.04715        1818.3964
## 11           156.59407        5801.2836
## 12           169.11111        6908.1956
table(km.out$cluster)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  147    1  404   10    3   56    3 2652  964   25   12    9

Seems that cluster 2 contains all the observations with the highest Amount per purchase. Cluster 6 has all the observations that have 1 month. In addition, cluster 12 seems to have the observations with the largest amounts.

We can try K from 2 through 20 and plot the total within sum of squares. We should find an “elbow” point. Wherever the graph bends and stops making gains in withinss we call that our K.

rng<-2:20 #K from 2 to 20
tries <-100 #Run the K Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points
for(v in rng){ # For each value of the range variable
 v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
 for(i in 1:tries){
 km.temp <-kmeans(eaggr1,centers=v) #Run kmeans
 v.totw.ss[i] <-km.temp$tot.withinss#Store the total withinss
 }
 avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
 ylab="Average Total Within Sum of Squares",
 xlab="Value of K")

This plot does not show a very strong elbow, but when K = 6 we do not see any significant gain in accuracy after that. So we will keep k = 6.

Now let us use k = 6:

km.out = kmeans(eaggr1, 6, 100)
km.out$centers
##        Amount FirstMth   LastMth    Months  Purchases Amount.per.Purchase
## 1 236546.0500 1.000000 12.000000 12.000000  911.33333           441.39913
## 2 103505.3000 1.333333 10.333333 10.000000 1434.50000         12982.71401
## 3  14687.7307 1.842105 11.723684 10.881579  361.21053            80.20588
## 4    715.9431 5.957194  9.114057  4.156863   51.13311            27.28889
## 5  46051.0215 1.800000 11.400000 10.600000 1257.35000          1054.94549
## 6   4296.7698 2.858929 11.376786  9.517857  232.84286            51.11608
##   Purchases.per.Month Amount.per.Month
## 1            75.94444       19712.1708
## 2           120.59848       20575.1009
## 3            36.01783        1611.1091
## 4            18.05070         264.3382
## 5           143.40644        6525.9096
## 6            28.24077         576.7304
km.out$tot.withinss
## [1] 24346249058
table(km.out$cluster)
## 
##    1    2    3    4    5    6 
##    3    6   76 3621   20  560

The total withiness for 6 clusters is larger than when k is 12 but we have half the amount of clusters. Now it seems that cluster 3 has the observations with the lowest months, cluster 5 has the highest amount per purchase, and cluster 4 has the highest amount.

Now let us try scaling our data prior to utilizing K-means.Scaling with make each variable have mean 0 and standard deviation 1.

eaggr1.scale = scale(eaggr1)
rng<-2:20 #K from 2 to 20
tries <-100 #Run the K Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points
for(v in rng){ # For each value of the range variable
 v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
 for(i in 1:tries){
 km.temp <-kmeans(eaggr1.scale,centers=v) #Run kmeans
 v.totw.ss[i] <-km.temp$tot.withinss#Store the total withinss
 }
 avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
 ylab="Average Total Within Sum of Squares",
 xlab="Value of K")

There is no clearly defined elbow in the plot but if you look closely you can see that any K afer 11-12, there is not any significant drop off in total sum of squares. Let us use K = 12.

km.out = kmeans(eaggr1.scale, 12, 100)
km.out$centers
##         Amount   FirstMth    LastMth     Months    Purchases
## 1  -0.05963992  1.4178847  0.5929924 -0.8329515  0.387615745
## 2  -0.06612478 -0.1342608  0.4697616  0.4821493 -0.066217649
## 3   8.99243989 -0.9158560 -2.3486278 -0.9692142 -0.403292799
## 4   0.14274356 -1.0079439  0.6313987  1.3995384  0.330989137
## 5  -0.11165606 -0.5342571 -1.7748615 -0.8757956 -0.005945682
## 6   2.77361168 -1.0945311  0.7348370  1.5574912  3.165832267
## 7  -0.16665038  1.4085976  0.5705494 -0.8417269 -0.259812066
## 8  -0.17705746  0.4733662 -0.5451149 -0.8478931 -0.301044977
## 9  -0.16531258 -0.5150404 -1.3156239 -0.5410982 -0.264183972
## 10 19.52598348 -0.8716126  0.5333034  1.2005071  3.318333234
## 11  7.66043325 -0.6503959  0.7952971  1.2005071 24.476395326
## 12 -0.19494157 -0.9575838 -2.3559882 -0.9369627 -0.330645165
##    Amount.per.Purchase Purchases.per.Month Amount.per.Month
## 1         -0.038142852          2.89591901       0.44945949
## 2         -0.018656493         -0.29996888      -0.12573919
## 3         64.212428514         -0.60939416      49.15644370
## 4         -0.020929313         -0.17706331      -0.07156315
## 5         -0.034608379          1.36733513       0.23860882
## 6          0.095548365          1.46150668       1.19824814
## 7         -0.021069296          0.03981242      -0.01190738
## 8         -0.009376149         -0.12519763      -0.05094411
## 9         -0.018702535         -0.25056068      -0.10323324
## 10         2.020366081          1.51162426      12.56799817
## 11        -0.034886173         19.42361882       4.76683902
## 12        -0.013319306         -0.17481201      -0.07088783
km.out$tot.withinss
## [1] 5676.597
table(km.out$cluster)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
##  115  790    1 1032   80   52  989  463  455    6    4  299

We see that we get a significanly better total withiness sum of squares compared to using non-scaled data. In addition, it seems that cluster 10 has the observations with the highest amount per purchase, cluster 8 has the observations with the highest amount, and cluster 9 has the observations with the largest purchases per month.

Hierarchical Clustering

Let us explore other methods to see whether we can acheive a better result. We will explore Hiearchial Clustering, which allows us to see the amount of clusters for various heights on a dendrogram. With Hiearchial Clustering we will need to scale our data which we have done previously.
There are different types of linkages we can use for hierarchical clustering: Complete: Maximal intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Single: Minimal intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
Average: Mean intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. We will not utilize this linkage as Centroid linkage can result in undesirable inversions.

In addition, we can also compare the different types of dis-similarity measures. Euclidean distance is one and then there is also correlation-based distance considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.

We will examine various combinations of these factors.

Let us compare the different types of linkages with Euclidean distance as the dis-simiarity measure. We will not perform Centroid linkage as it can result in undesirable inversions.

hc.complete=hclust(dist(eaggr1.scale), method="complete")
hc.average=hclust(dist(eaggr1.scale), method="average")
hc.single=hclust(dist(eaggr1.scale), method="single")

Let us plot the dendrograms for each method:

plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)

plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)

plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)

At a height of 40 we have 2 clusters for complete linkage, 1 for average, and 1 for single. Although, single linkage tends to yield extended clusters to which single leaves are fused one by one. The benefits of hierarchical clustering over K-means is that we are able to visualize the various cluster sizes.

We can see which cluster each observation is in depending on how many clusters we choose by utilizing the cutree function. Although we will not run it as it will cycle throuh each observation for a lengthy output.

#cutree(hc.complete, 12)
#cutree(hc.average, 12)
#cutree(hc.single, 12)

Now let us see how our dendrograms would look when we utilize correlation-based distance as our dis-similarity measure.

dd=as.dist(1-cor(t(eaggr1.scale)))
plot(hclust(dd, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")

plot(hclust(dd, method="average"), main="Average Linkage with Correlation-Based Distance", xlab="", sub="")

plot(hclust(dd, method="single"), main="Single Linkage with Correlation-Based Distance", xlab="", sub="")