Wood Study Clustering¶

Data preperation and initial clustering¶

Libararies used

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from numpy import arange

import seaborn as sns
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

import sklearn.metrics as sm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report

Below we have the dataset that we will be using for the clustering of the wood study. We again are using the margin between profit and loss, the percentage of time each participant choose the beneficial cards. We were able to add a column based on the age category of the participant also as each participant falls into two groups “18-40” and “61-88”. After these steps we were able to categorise each subject into cluster based on their metrics.

clustering = pd.read_csv('Data\clustering.csv')
cluster = KMeans(n_clusters = 4)
cols = clustering.columns[:]
clustering.drop(clustering.columns[[0]], axis = 1, inplace = True)
y_predicted = cluster.fit_predict(clustering[["Difference","Total-B/D"]])
clustering["cluster"] = y_predicted
clustering.head()

	Subjects	Difference	Total-B/D	Study	AgeProfile	cluster
0	Subj_317	-320	56.0	Wood	Young	0
1	Subj_318	-1030	63.0	Wood	Young	0
2	Subj_319	-1850	59.0	Wood	Young	2
3	Subj_320	-775	54.0	Wood	Young	0
4	Subj_321	-1600	65.0	Wood	Young	2

Here we have the inital clustering based on the profit/loss difference and the percentage of beneficial cards picked. As you can see there are a lot of outliers in the data, the four clusters are very distinct and clear. An interesting oberservation is that the majority of the subjects that picked B/D gained the most money but on the other hand they also lost the most money.

df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]

plt.scatter(df1.Difference, df1["Total-B/D"], color='green')
plt.scatter(df2.Difference, df2["Total-B/D"], color='blue')
plt.scatter(df3.Difference, df3["Total-B/D"], color='purple')
plt.scatter(df4.Difference, df4["Total-B/D"], color='brown')

plt.xlabel("Difference")
plt.ylabel("Total-B/D")

Text(0, 0.5, 'Total-B/D')

Normalization and refined clustering¶

clustering["cluster"] = y_predicted
clustering.head()

	Subjects	Difference	Total-B/D	Study	AgeProfile	cluster
0	Subj_317	0.582222	0.525424	Wood	Young	3
1	Subj_318	0.477037	0.644068	Wood	Young	3
2	Subj_319	0.355556	0.576271	Wood	Young	3
3	Subj_320	0.514815	0.491525	Wood	Young	3
4	Subj_321	0.392593	0.677966	Wood	Young	2

The array below has the centroids of our revised cluster

km.cluster_centers_

array([[0.68977366, 0.74952919],
       [0.84861111, 0.1970339 ],
       [0.32556614, 0.7535109 ],
       [0.48145145, 0.49175447]])

Now we have our revised cluster, with centroids added for extra insight. This cluster makes our previous doubts a little clearer. The subjects that rarely chose B/D lost the greatest money as shown below. Also although the B/D cards were clearly the most beneficial cards they too had a chance of a major loss, which can be seen from the particpants in the top left of our below cluster.

df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]

plt.scatter(df1.Difference, df1["Total-B/D"], color='green')
plt.scatter(df2.Difference, df2["Total-B/D"], color='blue')
plt.scatter(df3.Difference, df3["Total-B/D"], color='purple')
plt.scatter(df4.Difference, df4["Total-B/D"], color='brown')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color="orange", marker="*", label="centroid")

plt.xlabel("Difference")
plt.ylabel("Total-B/D")
plt.legend()

<matplotlib.legend.Legend at 0x24255448f10>

Further analysis¶

Our elbow graph does not have a dictinct breaking point. The optimum number of clusters is in the region of 3 and 4.

plt.xlabel("K")
plt.ylabel("Sum of squared error")
plt.plot(k_rng, sse)
#This is indictaing that the optimum number of clusters is 4 but 3 is also a reasonable choice

[<matplotlib.lines.Line2D at 0x2425550a310>]

Again the clusters aren’t overlapping and are showing relatively good dense clusters. Unfortuantely the scores aren’t nearer to 1 which is ideal.

from sklearn.metrics import silhouette_score
for n in range(2, 9):
    km = KMeans(n_clusters=n)
    km.fit_predict(clustering[["Difference", "Total-B/D"]])
    value = silhouette_score(clustering[["Difference", "Total-B/D"]], km.labels_, metric='euclidean')
    print(' Silhouette Score: %.3f' % value)

 Silhouette Score: 0.332
 Silhouette Score: 0.353
 Silhouette Score: 0.404
 Silhouette Score: 0.376
 Silhouette Score: 0.384
 Silhouette Score: 0.398
 Silhouette Score: 0.419

Lastly I wanted to briefly touch on the performance of the different ages. The scatterplot below shows us that the older participants were the bigger lossers but also the biggest gainers. Some of the older participants were blind to the pattern of the B/D cards where others were very quick to realise the two cards that were benfitting them. The younger group were in general more near the mean of the total study.

sns.scatterplot(data=clustering, x="Difference", y="Total-B/D", hue="AgeProfile")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

<matplotlib.legend.Legend at 0x24255970940>

Iowa Gambling Task

Wood Study Clustering¶

Data preperation and initial clustering¶

Normalization and refined clustering¶

Further analysis¶