Introduction

The objective of this analysis is to compare the performance of four selections methods, Best subsets, Ridge regression, Lasso regression, and Manual selection. We will first perform some exploratory data analysis by starting with some summary statistics and graphics, then investigating whether there seems to be any preliminary relationships between per capita crime (response) and any of the other variables (predictors) in the data. We will look into questions such as: Do any of the Boston suburbs appear to have particularly high crime rates? Does the variation in any of the predictors show any particular pattern?

In addition, we will build a linear regression model to predict per capita crime using all of the predictors, then using manual variable selection to pick a subset of variables.

Lastly, we will build models using automated model selection, such as best subset selection, the lasso, ridge regression, principal components regression, and partial least squares.

Data

The data source we will be using is the Boston data set which comes from the MASS library. It has 506 rows and 14 columns. The data set contains a variety of statistics for housing values and other information about Boston suburbs. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.

The following is a list of 14 variables (or “features”) and a description for each, relevant to the boston suburbs that were observed:

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat: lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

Exploratory Data Analysis

First of all, we need to explore and check the data to make sure that there are not any inconsistencies or anything that would throw our analysis off. When we perform exploratory data analysis, we investigate any preliminary relationships between the variables. In addition, we will look into any missing values, outliers/unusual values, and check for multicollinearity/high-correlation among the predictors.

Let us import the Boston data from the MASS library and see a summary and structure of the data.

library(MASS)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 observations with 14 variables. 12 of the variables are numeric while 2 are integer. ‘Chas’ is a dummy variable. There do not seem to be any observations that seem odd, such as percentages above 100.

Let us check for missing values:

sum(is.na(Boston))
## [1] 0

There are no missing values.

Let us see if any of the Boston suburbs appear to have particularly high crime rates:

plot(Boston$crim)

head(sort(Boston$crim, decreasing = TRUE))
## [1] 88.9762 73.5341 67.9208 51.1358 45.7461 41.5292

The plot shows that there is a huge jump in crime around the 400 mark for observations. There seem to be a few suburbs in particular which have very high crime rates at: 88.9762, 73.5341, 67.9208, 51.1358, 45.7461, and 41.5292.

Now to check for highly correlated predictors:

pairs(Boston)

cor(Boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000

From the pairs plot and more specifically the correlation matrix, we see that the following have a strong (absolute value > 0.7) positive or negative linear relationship:
nox vs indus, dis vs indus, tax vs indus, age vs nox, dis vs nox, age vs dis, tax vs rad, and lstat vs medv. For our response variable of crim, we do not see any preliminary relationships with any of the predictors.

Now let us plot the variables that seem to have some correlation to get a better idea of correlation.

plot(Boston$nox,Boston$indus)

plot(Boston$indus, Boston$dis)

plot(Boston$tax,Boston$indus)

plot(Boston$age, Boston$nox)

plot(Boston$nox,Boston$dis)

plot(Boston$age, Boston$dis)

plot(Boston$tax, Boston$rad)

plot(Boston$lstat,Boston$medv)

We see additional evidence from the plots that there is some degree of correlation among these predictors; some are linear but most are curvilinear relationships. This means we might need to transform some of our predictors and response variable, we will do so later. The issue with multicollinearity might present a problem in our prediction results; but we will let the automated model methods to choose which variables to use in the model.

Distribution and Normality of variables

The distribution plots for all the variables:

par(mfrow=c(4,4))
colnames = dimnames(Boston)[[2]]
len = dim(Boston)[2]


for(i in 1:len){
hist(Boston[,i], main=colnames[i], probability=TRUE, col="gray", border="white")
d <- density(Boston[,i])
lines(d, col="red")
}

The predictors crim, zn, chas, nox, dis, lstat, and medv seem to be right skewed. Rm is the only one that has a little bit of a bell shape. Let us perform a normality test to determine if they are from a normal distribution.

Let us see if our variables are normally distributed:

len = dim(Boston)[2]
for(i in 1:len){
print(shapiro.test(Boston[,i])) 
}
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.44996, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.55595, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.89979, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.27476, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.93564, p-value = 5.776e-14
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.96087, p-value = 2.412e-10
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.89201, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.90323, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.67964, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.81524, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.9036, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.47682, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.93691, p-value = 8.287e-14
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  Boston[, i]
## W = 0.91718, p-value = 4.941e-16

None of the variables appear to come from a normal distribution as they all have p-values that are lower than a significance of 0.05 which means we reject the null hypothesis for the Shapiro-Wilk test of normality.

Analysis

Linear Regression and Manual Variable Selection

First off, we will fit a multiple linear regression using all the variables and see what the result is:

lm.fit.all = lm(crim~., data = Boston)
summary(lm.fit.all)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

The resulting linear regression equation is:
crim = 17.033228 + 0.044855zn + -0.063855indus + -0.749134chas + -10.313535nox + 0.430131rm + 0.001452age + -0.987176dis + 0.588209rad + -0.003780tax + -0.271081ptratio + -0.007538black + 0.126211lstat + -0.198887*medv

We see that only 6 out of the 13 predictors are statistically significant at a significance level of 0.05. They are zn, indus, dis, rad, tax, black, and medv. In addition,out F-stat has a p-value of <2.2e-16, which means we reject the null hypothesis and conclude that at least one of our coefficients is statistically not equal to zero. The RSE is 6.439. The adjusted R-squared is 0.4396, so our model does not do a great job at modeling crime rate.

plot(lm.fit.all)