Data Preparation and Analysis

Introduction to Data

The data set at hand is divided into three different trials 95-trial, 100-trial and a 150-trial. There is three seperate csv files per trial. Let’s take the 95-trial - we have a csv file that records the participants choices, a csv file that records the participants losses and a csv file that records the participants winnings.

As all of the data is not gathered from one study but is in fact gathered from 10 seperate studies, to handle this we are given a fourth csv file which maps what study each participant took part in.

The studies differ in many ways from the size of the actual trials to the age demographics of the studies.

Libraries used

import numpy as np
import pandas as pd
from scipy.stats import norm
import seaborn as sns

import matplotlib.pyplot as plt

from numpy import arange

import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

import sklearn.metrics as sm
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report
df95 = pd.DataFrame()
df100 = pd.DataFrame()
df150 = pd.DataFrame()

df95["Total W"] = win95.sum(axis=1)
df95["Total L"] = loss95.sum(axis=1)

df100["Total W"] = win100.sum(axis=1)
df100["Total L"] = loss100.sum(axis=1)

df150["Total W"] = win150.sum(axis=1)
df150["Total L"] = loss150.sum(axis=1)

df95.reset_index(inplace=True)
df100.reset_index(inplace=True)
df150.reset_index(inplace=True)

df95["Study"] = index95["Study"].values
df100["Study"] = index100["Study"].values
df150["Study"] = index150["Study"].values

df95["Margin"] = df95["Total W"] + df95["Total L"]
df100["Margin"] = df100["Total W"] + df100["Total L"]
df150["Margin"] = df150["Total W"] + df150["Total L"]

df95["count_zeros"] = zeros95["zeros"].values
df100["count_zeros"] = zeros100["zeros"].values
df150["count_zeros"] = zeros150["zeros"].values

df95.size + df100.size + df150.size #2468

final = pd.DataFrame()
alternative = pd.DataFrame()
alternative = df95.append(df100)
final = alternative.append(df150)
final.size #2468
final.head()
index Total W Total L Study Margin count_zeros
0 Subj_1 5800 -4650 Fridberg 1150 80
1 Subj_2 7250 -7925 Fridberg -675 71
2 Subj_3 7100 -7850 Fridberg -750 76
3 Subj_4 7000 -7525 Fridberg -525 76
4 Subj_5 6450 -6350 Fridberg 100 76

Data visualisation

sns.scatterplot(data=final, x="Total W", y="Total L", hue="Study")
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
<matplotlib.legend.Legend at 0x228bf7359a0>
_images/Data Preparation and Analysis_12_1.png
sns.barplot(x="Study", y="Margin", data=final)
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.xticks(rotation=45)
No handles with labels found to put in legend.
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'Fridberg'),
  Text(1, 0, 'Horstmann'),
  Text(2, 0, 'Kjome'),
  Text(3, 0, 'Maia'),
  Text(4, 0, 'SteingroverInPrep'),
  Text(5, 0, 'Premkumar'),
  Text(6, 0, 'Wood'),
  Text(7, 0, 'Worthy'),
  Text(8, 0, 'Steingroever2011'),
  Text(9, 0, 'Wetzels')])
_images/Data Preparation and Analysis_13_2.png

With the scatter matrix plot below I hope the relationships between the different variables will provide me with some interesting insights.

pd.plotting.scatter_matrix(final[["Total W", "Total L", "Margin", "count_zeros"]], figsize=(12.5,12.5), hist_kwds=dict(bins=35))
plt.show()
_images/Data Preparation and Analysis_15_0.png

Observations and wood study visualisations

From my above data analysis I can see one study in particular whose margins were surprising. The Wood study in both graphs shows that participants were making considerable losses. Upon inspection this study was ran on two different groups of people. The first 90 participants were between the ages of 18-40 with the remaining 62 participants between the ages of 61-88. My proposal is to look at the difference between the two age groups and see whether the younger participants were quicker to identify the beneficial cards.

The subject dataframe I will use to cluster only the Wood study. This study was ran on two seperate groups with different ages so will hopefully provide interesting results.

subject = pd.DataFrame(columns=["Subjects"])
subject["Subjects"] = win100.index
subject["Difference"] = win100["Total"].values + loss100["Total"].values
subject["Total-B/D"] = choice_new["Total-B/D"].values/100 * 100
subject["Study"] = index100["Study"].values
subject = subject[subject.Study == "Wood"]
subject["AgeProfile"] = ""
subject.AgeProfile.values[:91] = "Young"
subject.AgeProfile.values[91:] = "Old"
print("Subject dataframe")
subject.head(10)
Subject dataframe
Subjects Difference Total-B/D Study AgeProfile
316 Subj_317 -320 56.0 Wood Young
317 Subj_318 -1030 63.0 Wood Young
318 Subj_319 -1850 59.0 Wood Young
319 Subj_320 -775 54.0 Wood Young
320 Subj_321 -1600 65.0 Wood Young
321 Subj_322 -550 52.0 Wood Young
322 Subj_323 -2210 63.0 Wood Young
323 Subj_324 -450 53.0 Wood Young
324 Subj_325 590 60.0 Wood Young
325 Subj_326 -380 66.0 Wood Young
sns.barplot(x="Subjects", y="Difference", data=subject, hue="AgeProfile")
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)
plt.show()
plt.close()
<ipython-input-14-0bcb2aab0190>:3: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  ax1 = plt.axes()
_images/Data Preparation and Analysis_26_1.png
pd.plotting.scatter_matrix(subject[["Difference", "Total-B/D"]], figsize=(12.5,12.5), hist_kwds=dict(bins=35))
plt.show()
_images/Data Preparation and Analysis_27_0.png

The dataset had a larger representation of younger people, using the dataframe above I will inspect the difference between younger and older both in profit margins and how quick the two age groups were to realise that some cards are more benficial then others. I use different analysis techniques including scatter graphs and k-means clustering to evaluate this hypothesis.

#This is the dataset that we will be using for our clustering of the wood study
subject.to_csv("Data/clustering.csv")

#This is the dataset we will be using for the whole study clustering
final.to_csv("Data/whole_clustering.csv")