The objective of this analysis is to perform cluster analysis (e.g., k-means and/or hierarchical clustering) on a data set which contains transactions occurring between 1-Dec-2010 and 30-Nov-2011 for a UK-based online retail store. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. Specifically the aim is to identify whether this company’s costumers can be segmented meaningfully by the recency and duration of their interaction with the company, the frequency of their purchases/orders, and the amount they spent (amounts are in Sterling pounds). Further segmentation by country would also be very helpful for marketing purposes.
The data set we will be analyzing is called “Online Retail”. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. There are 541909 observations with 8 variables Many customers of the company are wholesalers. The data is sourced from: Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
Before we start to analyze our data, we will first need to clean it up and get it ready for analysis.
Import the library and data:
library(readxl)
setwd("C:/Users/crzys/Documents")
eretail = read_excel("Online Retail.xlsx")
dim(eretail)
## [1] 541909 8
names(eretail)
## [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "UnitPrice" "CustomerID" "Country"
Clean up data:
eretail = eretail[eretail$Country != "Unspecified",] # remove 'unspecified' country
eretail = eretail[eretail$Quantity > 0,] # remove returns/cancellations
IDtab = table(eretail$Country, eretail$CustomerID) # crosstab country by customer ID
IDtab = apply(IDtab >0, 2, sum) # is any customer ID duplicated across countries?
duplicateIDs = names(IDtab[IDtab > 1]) # duplicate IDs to clean up
eretail = eretail[!is.element(eretail$CustomerID, duplicateIDs),]
rm(IDtab)
eretail$InvoiceMth = substr(eretail$InvoiceDate, 1, 7) # extract month of invoice
eretail = eretail[as.vector(eretail$InvoiceMth) != "2011-12",] # remove December 2011 as it only covers first week
eretail$Amount = eretail$Quantity * eretail$UnitPrice # compute amount per invoice item
eaggr = aggregate(Amount~Country+CustomerID, data=eretail, sum) # compute aggregate amount spent per customer
row.names(eaggr) = eaggr$CustomerID
eaggr = eaggr[,-2]
eaggr = cbind(eaggr, aggregate(InvoiceMth~CustomerID, data=eretail, min)[,-1]) # 1st month of customer interaction
names(eaggr)[3] = "FirstMth"
eaggr = cbind(eaggr, aggregate(InvoiceMth~CustomerID, data=eretail, max)[,-1]) # last month of cust. interaction
names(eaggr)[4] = "LastMth"
Re-label months and compute duration of customer interaction:
levels(eaggr$FirstMth) = 1:12
levels(eaggr$LastMth) = 1:12
eaggr$FirstMth = as.numeric(as.vector(eaggr$FirstMth))
eaggr$LastMth = as.numeric(as.vector(eaggr$LastMth))
eaggr$Months = eaggr$LastMth - eaggr$FirstMth + 1
eaggr = cbind(eaggr, apply( table(eretail$CustomerID, eretail$InvoiceMth) , 1, sum ) )
names(eaggr)[6] = "Purchases"
Useful statistics (which we may or may not decide to use):
eaggr$Amount.per.Purchase = eaggr$Amount / eaggr$Purchases
eaggr$Purchases.per.Month = eaggr$Purchases / eaggr$Months
eaggr$Amount.per.Month = eaggr$Amount / eaggr$Months
eaggr[1:30,]
## Country Amount FirstMth LastMth Months Purchases
## 12346 United Kingdom 77183.60 2 2 1 1
## 12347 Iceland 4085.18 1 11 11 171
## 12348 Finland 1797.24 1 10 10 31
## 12349 Italy 1757.55 12 12 1 73
## 12350 Norway 334.40 3 3 1 17
## 12352 Norway 2506.04 3 12 10 85
## 12353 Bahrain 89.00 6 6 1 4
## 12354 Spain 1079.40 5 5 1 58
## 12355 Bahrain 459.40 6 6 1 13
## 12356 Portugal 2811.43 2 12 11 59
## 12357 Switzerland 6207.67 12 12 1 131
## 12358 Austria 484.86 8 8 1 12
## 12359 Cyprus 6372.58 2 11 10 248
## 12360 Austria 2662.06 6 11 6 129
## 12361 Belgium 189.90 3 3 1 10
## 12362 Belgium 4697.19 3 12 10 236
## 12364 Belgium 1002.78 9 11 3 58
## 12365 Cyprus 641.38 3 3 1 22
## 12371 Switzerland 1887.96 11 11 1 63
## 12372 Denmark 1298.04 3 10 8 52
## 12373 Austria 364.60 3 3 1 14
## 12374 Austria 742.93 12 12 1 33
## 12375 Finland 457.50 10 12 3 17
## 12377 Switzerland 1628.12 1 2 2 77
## 12378 Switzerland 4008.62 9 9 1 219
## 12379 Belgium 852.24 7 10 4 40
## 12380 Belgium 2724.81 7 12 6 104
## 12381 Norway 1698.30 9 12 4 80
## 12383 Belgium 1850.56 1 7 7 99
## 12384 Switzerland 585.27 9 12 4 27
## Amount.per.Purchase Purchases.per.Month Amount.per.Month
## 12346 77183.60000 1.000000 77183.6000
## 12347 23.88994 15.545455 371.3800
## 12348 57.97548 3.100000 179.7240
## 12349 24.07603 73.000000 1757.5500
## 12350 19.67059 17.000000 334.4000
## 12352 29.48282 8.500000 250.6040
## 12353 22.25000 4.000000 89.0000
## 12354 18.61034 58.000000 1079.4000
## 12355 35.33846 13.000000 459.4000
## 12356 47.65136 5.363636 255.5845
## 12357 47.38679 131.000000 6207.6700
## 12358 40.40500 12.000000 484.8600
## 12359 25.69589 24.800000 637.2580
## 12360 20.63612 21.500000 443.6767
## 12361 18.99000 10.000000 189.9000
## 12362 19.90335 23.600000 469.7190
## 12364 17.28931 19.333333 334.2600
## 12365 29.15364 22.000000 641.3800
## 12371 29.96762 63.000000 1887.9600
## 12372 24.96231 6.500000 162.2550
## 12373 26.04286 14.000000 364.6000
## 12374 22.51303 33.000000 742.9300
## 12375 26.91176 5.666667 152.5000
## 12377 21.14442 38.500000 814.0600
## 12378 18.30420 219.000000 4008.6200
## 12379 21.30600 10.000000 213.0600
## 12380 26.20010 17.333333 454.1350
## 12381 21.22875 20.000000 424.5750
## 12383 18.69253 14.142857 264.3657
## 12384 21.67667 6.750000 146.3175
Let us take a deeper dive into our dataset
summary(eaggr)
## Country Amount FirstMth LastMth
## Length:4286 Min. : 0.0 Min. : 1.00 Min. : 1.00
## Class :character 1st Qu.: 303.9 1st Qu.: 2.00 1st Qu.: 8.00
## Mode :character Median : 656.6 Median : 5.00 Median :11.00
## Mean : 1952.1 Mean : 5.45 Mean : 9.47
## 3rd Qu.: 1593.3 3rd Qu.: 9.00 3rd Qu.:12.00
## Max. :268478.0 Max. :12.00 Max. :12.00
## Months Purchases Amount.per.Purchase Purchases.per.Month
## Min. : 1.00 Min. : 1.00 Min. : 0.00 Min. : 0.1818
## 1st Qu.: 1.00 1st Qu.: 17.00 1st Qu.: 12.34 1st Qu.: 6.6667
## Median : 4.00 Median : 40.00 Median : 17.69 Median : 13.0000
## Mean : 5.02 Mean : 88.54 Mean : 54.56 Mean : 20.4697
## 3rd Qu.: 9.00 3rd Qu.: 97.00 3rd Qu.: 24.82 3rd Qu.: 24.0000
## Max. :12.00 Max. :7376.00 Max. :77183.60 Max. :1145.5000
## Amount.per.Month
## Min. : 0.0
## 1st Qu.: 122.0
## Median : 221.5
## Mean : 400.3
## 3rd Qu.: 388.4
## Max. :77183.6
sum(is.na(eaggr))
## [1] 0
We see that there are no outliers or any NA’s.
How is the structure of our data?
str(eaggr)
## 'data.frame': 4286 obs. of 9 variables:
## $ Country : chr "United Kingdom" "Iceland" "Finland" "Italy" ...
## $ Amount : num 77184 4085 1797 1758 334 ...
## $ FirstMth : num 2 1 1 12 3 3 6 5 6 2 ...
## $ LastMth : num 2 11 10 12 3 12 6 5 6 12 ...
## $ Months : num 1 11 10 1 1 10 1 1 1 11 ...
## $ Purchases : int 1 171 31 73 17 85 4 58 13 59 ...
## $ Amount.per.Purchase: num 77183.6 23.9 58 24.1 19.7 ...
## $ Purchases.per.Month: num 1 15.5 3.1 73 17 ...
## $ Amount.per.Month : num 77184 371 180 1758 334 ...
There are 4286 observations with 9 variables.
Let us begin clustering our data. We first want to be able to cluster the customers by their recency, which is the most recent month they made a purchase, and duration of their interaction with the company.
First we will set a seed so that our example will be reproducible.
set.seed(1)
We removed the country variable as it is not a numerical number and will give us an error when we perform K-means.
eaggr1 = eaggr[,2:9]
Since there are 12 levels for the recency variable (eagrr1$LastMth) and we want to segment the customers, we will let the number of clusters be 12. We will first perform k-means clustering to see how well it performs grouping our data:
km.out = kmeans(eaggr1, 12, 100)
Let us see the within cluster sum of squares:
km.out$withinss
## [1] 188351859 0 268466669 1613198483 207055871 185479600
## [7] 3075186164 249602666 327124157 431530577 1491863833 128761589
km.out$tot.withinss
## [1] 8166621469
We see that our within cluster sum of squares is very large. There should be a more effective way to perform customer segmentation.
Here is the average variable value for each cluster and distribution of the number of observations in each cluster:
km.out$centers
## Amount FirstMth LastMth Months Purchases Amount.per.Purchase
## 1 6393.2455 2.027211 11.591837 10.564626 308.43537 60.50109
## 2 77183.6000 2.000000 2.000000 1.000000 1.00000 77183.60000
## 3 3403.5227 3.069307 11.329208 9.259901 204.50248 47.22644
## 4 36739.1270 1.700000 10.800000 10.100000 1294.70000 1913.22265
## 5 126380.1000 1.333333 12.000000 11.666667 2491.33333 94.61492
## 6 11473.2995 1.678571 11.785714 11.107143 368.42857 54.98193
## 7 236546.0500 1.000000 12.000000 12.000000 911.33333 441.39913
## 8 407.6194 6.461538 8.576923 3.115385 32.95551 24.73211
## 9 1554.9430 4.589212 10.577801 6.988589 100.00830 34.37656
## 10 20616.3256 1.520000 11.920000 11.400000 363.12000 121.75286
## 11 59861.4217 1.750000 12.000000 11.250000 1111.00000 199.62687
## 12 6908.1956 9.111111 9.111111 1.000000 169.11111 78.80309
## Purchases.per.Month Amount.per.Month
## 1 29.11683 637.1908
## 2 1.00000 77183.6000
## 3 25.35381 458.6645
## 4 108.33333 7462.8447
## 5 209.41919 10847.1157
## 6 33.08742 1071.4156
## 7 75.94444 19712.1708
## 8 17.34540 223.2173
## 9 19.95807 377.6030
## 10 31.04715 1818.3964
## 11 156.59407 5801.2836
## 12 169.11111 6908.1956
table(km.out$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 147 1 404 10 3 56 3 2652 964 25 12 9
Seems that cluster 2 contains all the observations with the highest Amount per purchase. Cluster 6 has all the observations that have 1 month. In addition, cluster 12 seems to have the observations with the largest amounts.
We can try K from 2 through 20 and plot the total within sum of squares. We should find an “elbow” point. Wherever the graph bends and stops making gains in withinss we call that our K.
rng<-2:20 #K from 2 to 20
tries <-100 #Run the K Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points
for(v in rng){ # For each value of the range variable
v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
km.temp <-kmeans(eaggr1,centers=v) #Run kmeans
v.totw.ss[i] <-km.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K")
This plot does not show a very strong elbow, but when K = 6 we do not see any significant gain in accuracy after that. So we will keep k = 6.
Now let us use k = 6:
km.out = kmeans(eaggr1, 6, 100)
km.out$centers
## Amount FirstMth LastMth Months Purchases Amount.per.Purchase
## 1 236546.0500 1.000000 12.000000 12.000000 911.33333 441.39913
## 2 103505.3000 1.333333 10.333333 10.000000 1434.50000 12982.71401
## 3 14687.7307 1.842105 11.723684 10.881579 361.21053 80.20588
## 4 715.9431 5.957194 9.114057 4.156863 51.13311 27.28889
## 5 46051.0215 1.800000 11.400000 10.600000 1257.35000 1054.94549
## 6 4296.7698 2.858929 11.376786 9.517857 232.84286 51.11608
## Purchases.per.Month Amount.per.Month
## 1 75.94444 19712.1708
## 2 120.59848 20575.1009
## 3 36.01783 1611.1091
## 4 18.05070 264.3382
## 5 143.40644 6525.9096
## 6 28.24077 576.7304
km.out$tot.withinss
## [1] 24346249058
table(km.out$cluster)
##
## 1 2 3 4 5 6
## 3 6 76 3621 20 560
The total withiness for 6 clusters is larger than when k is 12 but we have half the amount of clusters. Now it seems that cluster 3 has the observations with the lowest months, cluster 5 has the highest amount per purchase, and cluster 4 has the highest amount.
Now let us try scaling our data prior to utilizing K-means.Scaling with make each variable have mean 0 and standard deviation 1.
eaggr1.scale = scale(eaggr1)
rng<-2:20 #K from 2 to 20
tries <-100 #Run the K Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points
for(v in rng){ # For each value of the range variable
v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
km.temp <-kmeans(eaggr1.scale,centers=v) #Run kmeans
v.totw.ss[i] <-km.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K")
There is no clearly defined elbow in the plot but if you look closely you can see that any K afer 11-12, there is not any significant drop off in total sum of squares. Let us use K = 12.
km.out = kmeans(eaggr1.scale, 12, 100)
km.out$centers
## Amount FirstMth LastMth Months Purchases
## 1 -0.05963992 1.4178847 0.5929924 -0.8329515 0.387615745
## 2 -0.06612478 -0.1342608 0.4697616 0.4821493 -0.066217649
## 3 8.99243989 -0.9158560 -2.3486278 -0.9692142 -0.403292799
## 4 0.14274356 -1.0079439 0.6313987 1.3995384 0.330989137
## 5 -0.11165606 -0.5342571 -1.7748615 -0.8757956 -0.005945682
## 6 2.77361168 -1.0945311 0.7348370 1.5574912 3.165832267
## 7 -0.16665038 1.4085976 0.5705494 -0.8417269 -0.259812066
## 8 -0.17705746 0.4733662 -0.5451149 -0.8478931 -0.301044977
## 9 -0.16531258 -0.5150404 -1.3156239 -0.5410982 -0.264183972
## 10 19.52598348 -0.8716126 0.5333034 1.2005071 3.318333234
## 11 7.66043325 -0.6503959 0.7952971 1.2005071 24.476395326
## 12 -0.19494157 -0.9575838 -2.3559882 -0.9369627 -0.330645165
## Amount.per.Purchase Purchases.per.Month Amount.per.Month
## 1 -0.038142852 2.89591901 0.44945949
## 2 -0.018656493 -0.29996888 -0.12573919
## 3 64.212428514 -0.60939416 49.15644370
## 4 -0.020929313 -0.17706331 -0.07156315
## 5 -0.034608379 1.36733513 0.23860882
## 6 0.095548365 1.46150668 1.19824814
## 7 -0.021069296 0.03981242 -0.01190738
## 8 -0.009376149 -0.12519763 -0.05094411
## 9 -0.018702535 -0.25056068 -0.10323324
## 10 2.020366081 1.51162426 12.56799817
## 11 -0.034886173 19.42361882 4.76683902
## 12 -0.013319306 -0.17481201 -0.07088783
km.out$tot.withinss
## [1] 5676.597
table(km.out$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 115 790 1 1032 80 52 989 463 455 6 4 299
We see that we get a significanly better total withiness sum of squares compared to using non-scaled data. In addition, it seems that cluster 10 has the observations with the highest amount per purchase, cluster 8 has the observations with the highest amount, and cluster 9 has the observations with the largest purchases per month.
Let us explore other methods to see whether we can acheive a better result. We will explore Hiearchial Clustering, which allows us to see the amount of clusters for various heights on a dendrogram. With Hiearchial Clustering we will need to scale our data which we have done previously.
There are different types of linkages we can use for hierarchical clustering: Complete: Maximal intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Single: Minimal intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
Average: Mean intercluster dissimilarity. Compute all pairwise dis-similarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. We will not utilize this linkage as Centroid linkage can result in undesirable inversions.
In addition, we can also compare the different types of dis-similarity measures. Euclidean distance is one and then there is also correlation-based distance considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
We will examine various combinations of these factors.
Let us compare the different types of linkages with Euclidean distance as the dis-simiarity measure. We will not perform Centroid linkage as it can result in undesirable inversions.
hc.complete=hclust(dist(eaggr1.scale), method="complete")
hc.average=hclust(dist(eaggr1.scale), method="average")
hc.single=hclust(dist(eaggr1.scale), method="single")
Let us plot the dendrograms for each method:
plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)
At a height of 40 we have 2 clusters for complete linkage, 1 for average, and 1 for single. Although, single linkage tends to yield extended clusters to which single leaves are fused one by one. The benefits of hierarchical clustering over K-means is that we are able to visualize the various cluster sizes.
We can see which cluster each observation is in depending on how many clusters we choose by utilizing the cutree function. Although we will not run it as it will cycle throuh each observation for a lengthy output.
#cutree(hc.complete, 12)
#cutree(hc.average, 12)
#cutree(hc.single, 12)
Now let us see how our dendrograms would look when we utilize correlation-based distance as our dis-similarity measure.
dd=as.dist(1-cor(t(eaggr1.scale)))
plot(hclust(dd, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")
plot(hclust(dd, method="average"), main="Average Linkage with Correlation-Based Distance", xlab="", sub="")
plot(hclust(dd, method="single"), main="Single Linkage with Correlation-Based Distance", xlab="", sub="")