This notebook explores a data set about abalones, which are essentially sea snails. Specifically, the data set describes several physical characteristics of abalones, and then gives a number of rings found in the shell. Counting rings is time-consuming and difficult, providing a motive to build a machine-learning algorithm capable of predicting the number of rings from other physical characteristics.
In this notebook, we'll explore the abalone shell data, examine distributions, and figure out first steps toward modeling the data and making predictions.
Start by importing libraries:
%matplotlib inline
# ml
from sklearn import linear_model, svm
# numbers
import numpy as np
import pandas as pd
# stats
import statsmodels.api as sm
import scipy.stats as stats
# plots
import matplotlib.pyplot as plt
import seaborn as sns
# utils
import os, re
Now grab the data from the abalone/
directory:
# List data files
os.listdir('abalone/')
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# data
df = pd.read_csv('abalone/Dataset.data', header=None, sep=' ', names=xlabs+ylabs)
train_df = df[:3000]
test_df = df[3000:]
# Czech it out
df.info()
Start exploring the data by plotting some histograms of different quantities. Once we do that, we'll get a better idea of how they are distributed amongst the population of abalones sampled for this data set.
fig = plt.figure(figsize=(14,4))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
# Plot weight information
sns.distplot(df['Whole weight'], label='Whole', ax=ax1)
sns.distplot(df['Shucked weight'], label='Shucked', ax=ax1)
sns.distplot(df['Viscera weight'], label='Viscera', ax=ax1)
sns.distplot(df['Shell weight'], label='Shell', ax=ax1)
ax1.legend()
# Plot growth information
sns.distplot(df['Length'], label='Length', ax=ax2)
sns.distplot(df['Diameter'], label='Diameter', ax=ax2)
sns.distplot(df['Height'], label='Height', ax=ax2)
ax2.legend()
# Plot system response
sns.distplot(df['Rings'], bins=18, ax=ax3)
It is worth noting that we also have a "Sex" column that indicates whether the abalone is a male, a female, or an infant. We would expect infants to be significantly different from adults, so we'll split the data into the infant and adult groups, and look at histograms for each.
inf_df = df[df['Sex']=='I']
ni_df = df[df['Sex']<>'I']
fig = plt.figure(figsize=(14,10))
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
ax6 = fig.add_subplot(236)
# Non infant weight info
sns.distplot(ni_df['Whole weight'], label='Whole', ax=ax1)
sns.distplot(ni_df['Shucked weight'], label='Shucked', ax=ax1)
sns.distplot(ni_df['Viscera weight'], label='Viscera', ax=ax1)
sns.distplot(ni_df['Shell weight'], label='Shell', ax=ax1)
ax1.legend()
ax1.set_title('Non-Infant Weights')
# Infant weight info
sns.distplot(inf_df['Whole weight'], label='Whole', ax=ax4)
sns.distplot(inf_df['Shucked weight'], label='Shucked', ax=ax4)
sns.distplot(inf_df['Viscera weight'], label='Viscera', ax=ax4)
sns.distplot(inf_df['Shell weight'], label='Shell', ax=ax4)
ax4.set_title('Infant Weights')
# Non-infant growth
sns.distplot(ni_df['Length'], label='Length', ax=ax2)
sns.distplot(ni_df['Diameter'], label='Diameter', ax=ax2)
sns.distplot(ni_df['Height'], label='Height', ax=ax2)
ax2.set_title('Non-Infant Growth')
ax2.legend()
# Infant growth
sns.distplot(inf_df['Length'], label='Length', ax=ax5)
sns.distplot(inf_df['Diameter'], label='Diameter', ax=ax5)
sns.distplot(inf_df['Height'], label='Height', ax=ax5)
ax5.set_title('Infant Growth')
# Plot system response
sns.distplot(ni_df['Rings'], bins=18, ax=ax3)
ax3.set_title('Non-Infant Rings')
sns.distplot(inf_df['Rings'], bins=18, ax=ax6)
ax6.set_title('Infant Rings')
ax1.set_xlabel('')
ax2.set_xlabel('')
ax3.set_xlabel('Rings')
ax4.set_xlabel('')
ax5.set_xlabel('')
ax6.set_xlabel('Rings')
Here are a few pieces of information we can glean from these distributions.
Weight distributions:
Length, height, and diameter distribution:
Infant vs. adult:
We can also explore variable transforms to see if there is a more appropriate way to look at the input parameters. Here, we try taking the log or the exponential of a quantity:
fig = plt.figure(figsize=(14,4))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
sns.distplot(ni_df['Length'].apply(lambda x : np.log10(x)), label='Length', ax=ax1)
sns.distplot(ni_df['Diameter'].apply(lambda x : np.log10(x)), label='Diameter', ax=ax1)
ax1.set_title('Log Length and Log Diam')
sns.distplot(ni_df['Length'], label='Length', ax=ax2)
sns.distplot(ni_df['Diameter'], label='Diameter', ax=ax2)
ax2.set_title('Length and Diam')
sns.distplot(ni_df['Length'].apply(lambda x : np.exp(x)), label='Length', ax=ax3)
sns.distplot(ni_df['Diameter'].apply(lambda x : np.exp(x)), label='Diameter', ax=ax3)
ax3.set_title('Exp Length and Exp Diam')
ax1.legend(loc='best')
plt.show()
None of these distributions indicate that any particular transform would work better than the original quantity, so we'll leave height and diameter alone for now.
If we look at how different variables correlate with one another, it may be possible to find a useful combination of variables.
For example, suppose we want to correlate the measured dimensions (height, length, and diameter) to the weight of the organism. Because height, length, and diameter all have units of length, the product of all three of these would give a quantity with units of length cubed, or volume. This approximate volume of the orgnaism should correlate with the weight of the organism - the slope of volume (x) vs. weight (y) would have a slope equal to the density.
Therefore, for an abalone, the volume is proportional to the length times diameter times height,
$$ V \sim L \times D \times H $$Looking at abalone shells, they are shaped like an oval bowl, with the height indicating the depth of the bowl. The shell's circular area can be computed from the area of an ellipse, which is $\pi a b$, where $a$ and $b$ are the major and minor axes. In the case of an abalone the major axis is proportional to the length, while the minor axis is proportional to the diameter.
ni_df = ni_df[ni_df['Height']>0]
inf_df = inf_df[inf_df['Height']>0]
ni_df.loc[:,'Volume'] = ni_df['Length'].values*ni_df['Diameter'].values*ni_df['Height'].values
inf_df.loc[:,'Volume'] = inf_df['Length'].values*inf_df['Diameter'].values*inf_df['Height'].values
fig = plt.figure(figsize=(14,4))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
ax1.plot(ni_df['Volume'], ni_df['Viscera weight'],'b*', color=sns.xkcd_rgb['dusty red'])
ax2.plot(ni_df['Volume'], ni_df['Shell weight'],'g*', color=sns.xkcd_rgb['cornflower'])
ax3.plot(ni_df['Volume'], ni_df['Shucked weight'],'*', color=sns.xkcd_rgb['dusty purple'])
ax1.set_xlabel('Volume')
ax2.set_xlabel('Volume')
ax3.set_xlabel('Volume')
ax1.set_ylabel('Viscera weight')
ax2.set_ylabel('Shell weight')
ax3.set_ylabel('Shucked weight')
ax2.set_title('Adult Abalones')
fig = plt.figure(figsize=(14,4))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
ax1.plot(inf_df['Volume'], inf_df['Viscera weight'],'b*', color=sns.xkcd_rgb['denim blue'])
ax2.plot(inf_df['Volume'], inf_df['Shell weight'],'g*', color=sns.xkcd_rgb['dusty green'])
ax3.plot(inf_df['Volume'], inf_df['Shucked weight'],'*', color=sns.xkcd_rgb['mauve'])
ax1.set_xlabel('Volume')
ax2.set_xlabel('Volume')
ax3.set_xlabel('Volume')
ax1.set_ylabel('Viscera weight')
ax2.set_ylabel('Shell weight')
ax3.set_ylabel('Shucked weight')
ax2.set_title('Infant Abalones')
From the plots above, it looks like adult abalones have a strong correlation between the "volume" and the viscera weight. Infant abalones, on the other hand, have a strong correlation between the volume and the shell weight.
The plots above can help to identify outliers - abalones whose volume is abnormally high for their viscera or shell weights, or vice-versa. (It appears some abalones have extremely small volumes.)
If we regress the volume versus viscera weight, we'll get a line. If we then compute the residuals from this linear model, we can identify the major outliers, and use those to prune problematic points.
plt.plot(ni_df['Volume'], ni_df['Shell weight'],'b*', color=sns.xkcd_rgb['dusty red'])
plt.xlabel('Volume')
plt.ylabel('Shell weight')
plt.title('Adult Abalones')
plt.show()
X = ni_df['Volume']
Y = ni_df['Shell weight']
lin = sm.OLS(Y,X).fit()
lin.summary()
Yhat = lin.predict(ni_df['Volume'])
ni_df.loc[:,'Predicted shell weight'] = Yhat
ni_df.loc[:,'Residual'] = Y - Yhat
ni_df.loc[:,'MSE'] = (Y - Yhat)**2
plt.plot(ni_df['Volume'], ni_df['Shell weight'],'*', color=sns.xkcd_rgb['dusty red'])
plt.plot(ni_df['Volume'], Yhat,'-', color=sns.xkcd_rgb['carmine'])
plt.show()
MSE = ni_df['MSE']
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111)
stats.probplot(MSE, dist='norm', plot=ax)
plt.show()
ni_df.sort_values('MSE',ascending=False).head()
thresh = 0.5
new_ni_df = ni_df[ni_df['MSE'] < thresh]
records_removed = len(ni_df) - len(new_ni_df)
print "Number of data points removed: %d"%(records_removed)
print "%0.1f%% of data was removed"%((float(records_removed)/len(df))*100)
We can also do the same thing with data about infant abalones - removing outliers.
plt.plot(inf_df['Volume'], inf_df['Shell weight'],'b*', color=sns.xkcd_rgb['dusty blue'])
plt.xlabel('Volume')
plt.ylabel('Shell weight')
plt.title('Infant Abalones')
plt.show()
X = inf_df['Volume']
Y = inf_df['Shell weight']
lin2 = sm.OLS(Y,X).fit()
lin2.summary()
Yhat = lin2.predict(inf_df['Volume'])
inf_df.loc[:,'Predicted shell weight'] = Yhat
inf_df.loc[:,'Residual'] = Y - Yhat
inf_df.loc[:,'MSE'] = (Y - Yhat)**2
plt.plot(inf_df['Volume'], inf_df['Shell weight'],'*', color=sns.xkcd_rgb['dusty blue'])
plt.plot(inf_df['Volume'], Yhat,'-', color=sns.xkcd_rgb['light blue'])
plt.show()
MSE = inf_df['MSE']
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111)
stats.probplot(MSE, dist='norm', plot=ax)
plt.show()
inf_df.sort_values('MSE',ascending=False).head()
thresh = 0.05
new_inf_df = inf_df[inf_df['MSE'] < thresh]
records_removed = len(inf_df) - len(new_inf_df)
print "Number of data points removed: %d"%(records_removed)
print "%0.2f%% of data was removed"%((float(records_removed)/len(df))*100)
It is interesting to note that the linear regression of "volume" versus predicted shell weight has nearly the same slope between the infants and adults - 6.87 for infants, 6.69 for adults.
While the quantile plot shows that the trendline is not really linear, the fact that both infant and adult abalones had similar slopes is an indication that the general upward trend holds true across abalones of different ages, and the rate of growth does not slow down once the abalones reach adulthood.
We can wrap all of this functionality (loading the data, fitting volume versus weight, and removing outliers) in a function that can be used in other notebooks.
def abalone_load(data_file, infant=False):
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# data
df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
if(infant):
new_df = df[ df['Sex']=='I' ]
else:
new_df = df[ df['Sex']<>'I' ]
return new_df
def infant_abalone_load(data_file):
return abalone_load(data_file,True)
def adult_abalone_load(data_file):
return abalone_load(data_file,False)
def abalone_removeoutliers(df,mse_tol,verbose=False):
df.loc[:,'Volume'] = df['Length'].values*df['Diameter'].values*df['Height'].values
X = df['Volume']
Y = df['Shell weight']
lin = sm.OLS(Y,X).fit()
Yhat = lin.predict(df['Volume'])
df.loc[:,'Predicted shell weight'] = Yhat
df.loc[:,'Residual'] = Y - Yhat
df.loc[:,'MSE'] = (Y - Yhat)**2
MSE = df['MSE']
thresh = 0.5
new_df = df[df['MSE'] < thresh]
records_removed = len(df) - len(new_df)
if(verbose):
print "Number of data points removed: %d"%(records_removed)
print "%0.2f%% of data was removed"%((float(records_removed)/len(df))*100)
del df['Predicted shell weight']
del df['Residual']
del df['MSE']
return df
def infant_abalone_removeoutliers(df):
return abalone_removeoutliers(df,0.5)
def adult_abalone_removeoutliers(df):
return abalone_removeoutliers(df,1.0)
my_inf_df = infant_abalone_removeoutliers(infant_abalone_load('abalone/Dataset.data'))