In a prior notebook, we performed a linear analysis of a data set consisting of physical measurements of abalones (sea snails). We saw in that notebook that linear models do not capture higher-order effects and the interactions of variables.
In this notebook we'll utilize models with higher-order interaction terms and explore different variable transformations. We mentioned the two principal ways to do this: the first way is to add, by hand, any variable transformations that we think might be important, and analyze the results by hand to determine if, in fact, the interaction is significant. The second way is to utilize different kernels to perform support vector regression, which (depending on the kernel) can automatically deduce which of the (potentially thousands of) variable transformations work best.
The snippets of code below will load and transform data, as well as remove outliers whose physical dimensions are not sensible.
%matplotlib inline
# numbers
import numpy as np
import pandas as pd
# stats
import statsmodels.api as sm
import scipy.stats as stats
# plots
import matplotlib.pyplot as plt
import seaborn as sns
# utils
import os, re
# Copy-and-paste from prior notebook
def abalone_load(data_file, infant=False):
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# data
df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
if(infant):
new_df = df[ df['Sex']=='I' ]
else:
new_df = df[ df['Sex']<>'I' ]
return new_df
def infant_abalone_load(data_file):
return abalone_load(data_file,True)
def adult_abalone_load(data_file):
return abalone_load(data_file,False)
def abalone_removeoutliers(df,mse_tol,verbose=False):
df.loc[:,'Volume'] = df['Length'].values*df['Diameter'].values*df['Height'].values
X = df['Volume']
Y = df['Shell weight']
lin = sm.OLS(Y,X).fit()
Yhat = lin.predict(df['Volume'])
df.loc[:,'Predicted shell weight'] = Yhat
df.loc[:,'Residual'] = Y - Yhat
df.loc[:,'MSE'] = (Y - Yhat)**2
MSE = df['MSE']
original_length = len(df)
thresh = 0.5
df = df[df['MSE'] < thresh]
df = df[ df['Height'] < 0.30 ]
new_length = len(df)
records_removed = original_length - new_length
if(verbose):
print "Number of data points removed: %d"%(records_removed)
print "%0.2f%% of data was removed"%((float(records_removed)/len(df))*100)
del df['Predicted shell weight']
del df['Residual']
del df['MSE']
del df['Volume']
return df
def infant_abalone_removeoutliers(df):
return abalone_removeoutliers(df,0.5)
def adult_abalone_removeoutliers(df):
return abalone_removeoutliers(df,1.0)
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
inf_df = infant_abalone_removeoutliers(infant_abalone_load('abalone/Dataset.data'))
adt_df = adult_abalone_removeoutliers(adult_abalone_load('abalone/Dataset.data'))
adt_df.head()
fig = plt.figure(figsize=(10,4))
(ax1,ax2) = (fig.add_subplot(121), fig.add_subplot(122))
sns.distplot(adt_df['Rings'], bins=10, ax=ax1)
sns.distplot(adt_df['Rings'].map(lambda x : np.log(x)), bins=10, ax=ax2)
ax1.set_title('Histogram: Rings')
ax1.set_xlabel('No. Rings')
ax1.set_ylabel('Frequency')
ax2.set_title('Histogram: log(Rings)')
ax2.set_xlabel('log(No. Rings)')
ax2.set_ylabel('Frequency')
plt.show()
We covered support vector regression earlier - this is a regression technique that utilizes kernels (defined in terms of integral convolutions of functions) to find the most efficient way of combining variables; the possible combinations of variables is defined by the kernel choice.
Let's review again how we construct a linear support vector regression model, and how it performs:
from sklearn.svm import SVR
The C
parameter is the error penalty term: the larger C is, the longer it will take to train the model.
# ~5 seconds
svr_lin = SVR(kernel='linear', C=1e3)
svr_lin = svr_lin.fit( adt_df[xqlabs], adt_df['Rings'].values )
y = adt_df['Rings'].values
yhat_lin = svr_lin.predict(adt_df[xqlabs])
svr_lin_resid = yhat_lin - y
sns.distplot(svr_lin_resid, color=sns.xkcd_rgb['brick'], label='Linear SVR')
plt.title('Linear SVR Residuals')
plt.legend()
plt.show()
The state vector regression algorithm results in a linear model with large residuals that are not normally distributed - meaning our predictions will be off by a substantial amount. This was what motivated a need for higher-order models.
To introduce a more complex kernel, we have a couple of choices of pre-configured kernels:
'linear'
) - what we were using before, response is linear with respect to each variable'poly'
) - a kernel including polynomial interaction effects'rbf'
) - radial basis function (decaying squared exponential)'sigmoid'
) - S-curve function fitWe'll start with the polynomial kernel, and go from there.
We can pass 'poly'
to the kernel argument to perform SVR with a polynomial kernel, which will include interaction effects between variables. We can start with degree 2 (it turns out the fit gets worse as the degree increases) - when we do that, here are the residuals we get:
# ~15 seconds
svr_poly = SVR(kernel='poly', degree=2, C=1e4)
svr_poly = svr_poly.fit( adt_df[xqlabs], adt_df['Rings'].values )
y = adt_df['Rings'].values
yhat_poly = svr_poly.predict(adt_df[xqlabs])
svr_poly_resid = yhat_poly - y
sns.distplot(svr_lin_resid, color=sns.xkcd_rgb['brick'], label='Linear SVR')
sns.distplot(svr_poly_resid, color=sns.xkcd_rgb['cornflower'], label='Poly SVR')
plt.title('Linear vs. Polynomial SVR Residuals')
plt.legend()
plt.show()
This result is a bit disappointing - adding interaction terms to our sVR model still gets us no noticeable improvements.
print svr_lin.score( adt_df[xqlabs], adt_df['Rings'].values)
print svr_poly.score( adt_df[xqlabs], adt_df['Rings'].values)
Next we'll try state vector regression with a radial basis function as our kernel. The radial basis function is a squared exponential function.
# ~10 seconds
svr_rbf = SVR(kernel='rbf', C=1e4)
svr_rbf = svr_rbf.fit( adt_df[xqlabs], adt_df['Rings'].values )
print svr_rbf.score( adt_df[xqlabs], adt_df['Rings'].values)
#svr_rbf2 = SVR(kernel='rbf', C=1e6)
#svr_rbf2 = svr_rbf2.fit( adt_df[xqlabs], adt_df['Rings'].values )
#print svr_rbf2.score( adt_df[xqlabs], adt_df['Rings'].values)
# The score that results is 0.4562
y = adt_df['Rings'].values
yhat_rbf = svr_rbf.predict(adt_df[xqlabs])
svr_rbf_resid = yhat_rbf - y
sns.distplot(svr_rbf_resid, color=sns.xkcd_rgb['periwinkle'], label='RBF SVR')
sns.distplot(svr_lin_resid, color=sns.xkcd_rgb['brick'], label='Linear SVR')
plt.title('Linear vs. RBF SVR Residuals')
plt.legend()
plt.show()
The scores for these state vector regression models can be improved by scaling the input variables.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(adt_df[xqlabs])
X = scaler.transform(adt_df[xqlabs])
# This takes 3 minutes
svr_rbf3 = SVR(kernel='rbf', C=1e4)
svr_rbf3 = svr_rbf3.fit( X, adt_df['Rings'].values )
print svr_rbf3.score( X, adt_df['Rings'].values )
The $R^2$ statistic is a measure of how much of the variance in the data is explained by the state vector regression model.
A value of 0.589 means that there is still about 40% of the variance from the mean that is not explained (or, fit) by the model. This is much improved, however, over a vlaue of less than 0.4, which we saw with linear models.
y = adt_df['Rings'].values
yhat_rbf3 = svr_rbf3.predict(X)
svr_rbf3_resid = yhat_rbf3 - y
sns.distplot(svr_lin_resid, color=sns.xkcd_rgb['brick'], label='Linear SVR')
sns.distplot(svr_rbf3_resid, color=sns.xkcd_rgb['periwinkle'], label='Scaled RBF SVR')
plt.title('Linear vs. Scaled RBF SVR Residuals')
plt.xlim([-10,10])
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.legend()
plt.show()
The residual is more tightly clustered around 0, so the scaling helped - but the tails of the distribution are still significant. If we look at a quantile plot of the model residuals, we can see that a significant number of residuals don't follow a normal distribution:
fig = plt.figure(figsize=(4,4))
ax1 = fig.add_subplot(111)
stats.probplot(svr_rbf3_resid, dist='norm', plot=ax1)
ax1.set_title('SVR with Scaled Inputs: Residual Quantiles')
plt.show()
fig = plt.figure(figsize=(14,8))
ax1,ax2,ax3 = fig.add_subplot(231), fig.add_subplot(232), fig.add_subplot(233)
ax4,ax5,ax6 = fig.add_subplot(234), fig.add_subplot(235), fig.add_subplot(236)
ax1.plot(y,svr_rbf3_resid,'*')
ax1.set_xlabel('System Response')
ax2.plot(yhat_rbf3,svr_rbf3_resid,'*')
ax2.set_xlabel('Model Prediction of Response')
ax2.set_ylabel('Residual (Normalized RBF SVR)')
ax4.plot(adt_df['Length'],svr_rbf3_resid,'*')
ax4.set_xlabel('Length')
ax5.plot(adt_df['Height'],svr_rbf3_resid,'*')
ax5.set_xlabel('Height')
ax6.plot(adt_df['Diameter'],svr_rbf3_resid,'*')
ax6.set_xlabel('Diameter')
ax1.set_ylabel('Residual')
ax2.set_ylabel('Residual')
ax4.set_ylabel('Residual')
ax5.set_ylabel('Residual')
ax6.set_ylabel('Residual')
plt.show()
From the plot of system response $y$ versus residuals $\hat{y} - y$, we can see a pretty clear trend: model residuals still have functional dependence on the response value. Residuals that are more negative as the system response gets larger means that we are systematically overpredicting the age of younger abalones and underpredicting the age of older abalones. There is also a slight functional dependence of the residual on length, height, and diameter - as these quantities get larger, the residual also gets larger.
From this, we can conclude that support vector regression, which explores different variable interactions, together with scaling of variables, leads to a better fit - but still doesn't capture all of the higher-order effects. This information can help guide choices of next steps.
Methods for improving fit: