Wood Study Clustering

Data preperation and initial clustering

Libararies used

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from numpy import arange

import seaborn as sns
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

import sklearn.metrics as sm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report

Below we have the dataset that we will be using for the clustering of the wood study. We again are using the margin between profit and loss, the percentage of time each participant choose the beneficial cards. We were able to add a column based on the age category of the participant also as each participant falls into two groups “18-40” and “61-88”. After these steps we were able to categorise each subject into cluster based on their metrics.

clustering = pd.read_csv('Data\clustering.csv')
cluster = KMeans(n_clusters = 4)
cols = clustering.columns[:]
clustering.drop(clustering.columns[[0]], axis = 1, inplace = True)
y_predicted = cluster.fit_predict(clustering[["Difference","Total-B/D"]])
clustering["cluster"] = y_predicted
clustering.head()
Subjects Difference Total-B/D Study AgeProfile cluster
0 Subj_317 -320 56.0 Wood Young 0
1 Subj_318 -1030 63.0 Wood Young 0
2 Subj_319 -1850 59.0 Wood Young 2
3 Subj_320 -775 54.0 Wood Young 0
4 Subj_321 -1600 65.0 Wood Young 2

Here we have the inital clustering based on the profit/loss difference and the percentage of beneficial cards picked. As you can see there are a lot of outliers in the data, the four clusters are very distinct and clear. An interesting oberservation is that the majority of the subjects that picked B/D gained the most money but on the other hand they also lost the most money.

df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]

plt.scatter(df1.Difference, df1["Total-B/D"], color='green')
plt.scatter(df2.Difference, df2["Total-B/D"], color='blue')
plt.scatter(df3.Difference, df3["Total-B/D"], color='purple')
plt.scatter(df4.Difference, df4["Total-B/D"], color='brown')

plt.xlabel("Difference")
plt.ylabel("Total-B/D")
Text(0, 0.5, 'Total-B/D')
_images/Wood_clustering_7_1.png

Normalization and refined clustering

clustering["cluster"] = y_predicted
clustering.head()
Subjects Difference Total-B/D Study AgeProfile cluster
0 Subj_317 0.582222 0.525424 Wood Young 3
1 Subj_318 0.477037 0.644068 Wood Young 3
2 Subj_319 0.355556 0.576271 Wood Young 3
3 Subj_320 0.514815 0.491525 Wood Young 3
4 Subj_321 0.392593 0.677966 Wood Young 2

The array below has the centroids of our revised cluster

km.cluster_centers_
array([[0.68977366, 0.74952919],
       [0.84861111, 0.1970339 ],
       [0.32556614, 0.7535109 ],
       [0.48145145, 0.49175447]])

Now we have our revised cluster, with centroids added for extra insight. This cluster makes our previous doubts a little clearer. The subjects that rarely chose B/D lost the greatest money as shown below. Also although the B/D cards were clearly the most beneficial cards they too had a chance of a major loss, which can be seen from the particpants in the top left of our below cluster.

df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]

plt.scatter(df1.Difference, df1["Total-B/D"], color='green')
plt.scatter(df2.Difference, df2["Total-B/D"], color='blue')
plt.scatter(df3.Difference, df3["Total-B/D"], color='purple')
plt.scatter(df4.Difference, df4["Total-B/D"], color='brown')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color="orange", marker="*", label="centroid")

plt.xlabel("Difference")
plt.ylabel("Total-B/D")
plt.legend()
<matplotlib.legend.Legend at 0x24255448f10>
_images/Wood_clustering_15_1.png

Further analysis

Our elbow graph does not have a dictinct breaking point. The optimum number of clusters is in the region of 3 and 4.

plt.xlabel("K")
plt.ylabel("Sum of squared error")
plt.plot(k_rng, sse)
#This is indictaing that the optimum number of clusters is 4 but 3 is also a reasonable choice
[<matplotlib.lines.Line2D at 0x2425550a310>]
_images/Wood_clustering_19_1.png

Again the clusters aren’t overlapping and are showing relatively good dense clusters. Unfortuantely the scores aren’t nearer to 1 which is ideal.

from sklearn.metrics import silhouette_score
for n in range(2, 9):
    km = KMeans(n_clusters=n)
    km.fit_predict(clustering[["Difference", "Total-B/D"]])
    value = silhouette_score(clustering[["Difference", "Total-B/D"]], km.labels_, metric='euclidean')
    print(' Silhouette Score: %.3f' % value)
 Silhouette Score: 0.332
 Silhouette Score: 0.353
 Silhouette Score: 0.404
 Silhouette Score: 0.376
 Silhouette Score: 0.384
 Silhouette Score: 0.398
 Silhouette Score: 0.419

Lastly I wanted to briefly touch on the performance of the different ages. The scatterplot below shows us that the older participants were the bigger lossers but also the biggest gainers. Some of the older participants were blind to the pattern of the B/D cards where others were very quick to realise the two cards that were benfitting them. The younger group were in general more near the mean of the total study.

sns.scatterplot(data=clustering, x="Difference", y="Total-B/D", hue="AgeProfile")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
<matplotlib.legend.Legend at 0x24255970940>
_images/Wood_clustering_23_1.png