Whole Dataset Clustering¶
Data preperation and clustering¶
Libraries used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import arange
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
import sklearn.metrics as sm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report
So for my clustering analysis of the whole study I would like to see if the studies as a whole have interesting cluster patterns. To do this step I will have to map an integer value to the study so that clustering can take place. I am focusing on the profit/loss margins of the individual subjects and I am also going to look at their total zeros received. This tells us how often the subjects didn’t lose money. The lower this figure is indicates that the subjects were losing money more regularly.
index | Total W | Total L | Study | Margin | count_zeros | cluster | |
---|---|---|---|---|---|---|---|
0 | Subj_1 | 5800 | -4650 | Fridberg | 1150 | 80 | 1 |
1 | Subj_2 | 7250 | -7925 | Fridberg | -675 | 71 | 3 |
2 | Subj_3 | 7100 | -7850 | Fridberg | -750 | 76 | 3 |
3 | Subj_4 | 7000 | -7525 | Fridberg | -525 | 76 | 3 |
4 | Subj_5 | 6450 | -6350 | Fridberg | 100 | 76 | 1 |
This is the results of our clustering based on the amount of zeros each subject chose and their respective margin of profit and loss.
df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]
plt.scatter(df1.Margin, df1.count_zeros, color='green')
plt.scatter(df2.Margin, df2.count_zeros, color='red')
plt.scatter(df3.Margin, df3.count_zeros, color='black')
plt.scatter(df4.Margin, df4.count_zeros, color='blue')
plt.xlabel("Margin")
plt.ylabel("count_zeros")
Text(0, 0.5, 'count_zeros')
Normalization and refined clustering¶
Below we can see the dataframe after normalization has taken place. The overall aim of normalization is to manipulate the values of the choosen columns in a particular dataset to a common scale. In machine learning normalization can improve learning rates and can also make weights easier to initialise. The main reason I am using normalization in my analysis is that I want to investigate whether it improves model accuracy dramatically or whether the results are very similar. I am also curious to see whether there is one variable that is steering the performance.
clustering[['Margin','count_zeros']] = minmax_scale(clustering[['Margin','count_zeros']])
km = KMeans(n_clusters=4)
y_predicted = km.fit_predict(clustering[["Margin", "count_zeros"]])
clustering["cluster"] = y_predicted
clustering.head()
index | Total W | Total L | Study | Margin | count_zeros | cluster | |
---|---|---|---|---|---|---|---|
0 | Subj_1 | 5800 | -4650 | Fridberg | 0.675000 | 0.404494 | 0 |
1 | Subj_2 | 7250 | -7925 | Fridberg | 0.446875 | 0.303371 | 1 |
2 | Subj_3 | 7100 | -7850 | Fridberg | 0.437500 | 0.359551 | 1 |
3 | Subj_4 | 7000 | -7525 | Fridberg | 0.465625 | 0.359551 | 1 |
4 | Subj_5 | 6450 | -6350 | Fridberg | 0.543750 | 0.359551 | 1 |
km.cluster_centers_
array([[0.69183361, 0.31542525],
[0.50055668, 0.34221899],
[0.53452744, 0.79761579],
[0.31934307, 0.33429017]])
The below graph differs from the original in two mains ways. As I have mentioned already the data is now standarized, this should imrove the overall clustering of the dataset. I have also calculated the centroids of the clusters and added them to the graph giving us some added information.
df1 = clustering[clustering.cluster==0]
df2 = clustering[clustering.cluster==1]
df3 = clustering[clustering.cluster==2]
df4 = clustering[clustering.cluster==3]
plt.scatter(df1.Margin, df1.count_zeros, color='green')
plt.scatter(df2.Margin, df2.count_zeros, color='red')
plt.scatter(df3.Margin, df3.count_zeros, color='black')
plt.scatter(df4.Margin, df4.count_zeros, color='blue')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color="orange", marker="*", label="centroid")
plt.xlabel("Margin")
plt.ylabel("count_zeros")
plt.legend()
<matplotlib.legend.Legend at 0x268e9917790>
Further analysis¶
k_rng = range(1,10)
sse = []
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(clustering[["Margin", "count_zeros"]])
sse.append(km.inertia_)
Below is a diagram of the elbow method. This enable us to find the optimum number of clusters in the dataset. It is the most popular method when dealing with k means clustering to calculate this. From the graph below you can the elbow starts to bend at 3 indicating that the optimum number of clusters would be three and that the results may improve with a correction.
plt.xlabel("K")
plt.ylabel("Sum of squared error")
plt.plot(k_rng, sse)
[<matplotlib.lines.Line2D at 0x268e9e820d0>]
Below is the silhouette score for these clusters. The silhouette score is a metric used to calculate the efficiency of a certain clustering technique. The closer the the silhouette scores are to 1 means that they are further apart from eachother. The scores below are not below 0 which is good and tells us that there arent any overlapping clusters. The closer to 1 indicates the more dense clusters, which we don’t seem to have in our case.
from sklearn.metrics import silhouette_score
for n in range(2, 9):
km = KMeans(n_clusters=n)
km.fit_predict(clustering[["Margin", "count_zeros"]])
value = silhouette_score(clustering[["Margin", "count_zeros"]], km.labels_, metric='euclidean')
print(' Silhouette Score: %.3f' % value)
Silhouette Score: 0.577
Silhouette Score: 0.428
Silhouette Score: 0.356
Silhouette Score: 0.375
Silhouette Score: 0.372
Silhouette Score: 0.338
Silhouette Score: 0.350
Here we have an informative scatterplot of the different studies and the participants. You can clearly see that the subjects from both Steingrover and Wetzels more often than not did not lose money and still gained a respective amount of money. Another interesting observation is that some of the participants that were not receiving many zeros, so as a result were losing money in some cases still gained a large amount of money. This tells us that these participants found the more beneficial cards but reaped the downside to those cards also.
sns.scatterplot(data=clustering, x="Margin", y="count_zeros", hue="Study")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
<matplotlib.legend.Legend at 0x268e9eb5ac0>