The objective of this analysis is to analyze the Weekly dataset found in the ISLR library. It contains 1089 weekly returns for 21 years, from 1990 to 2010. We will produce some numerical and graphical summaries. Then use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors and see if any variables are significant.Compute the confusion matrix and overall fraction of correct predictions. Then fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. In addition, we will also repeat the previous step for LDA, QDA, and KNN with k = 1. We will experiment with various combinations of parameters and then determine which method performs the best.
Weekly dataset found in the ISLR library, which gives weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Format of the data:
Year: The year that the observation was recorded
Lag1: Percentage return for previous week
Lag: Percentage return for 2 weeks previous
Lag3: Percentage return for 3 weeks previous
Lag4: Percentage return for 4 weeks previous
Lag5: Percentage return for 5 weeks previous
Volume: Volume of shares traded (average number of daily shares traded in billions)
Today: Percentage return for this week
Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week
Source:
Raw values of the S&P 500 were obtained from Yahoo Finance and then converted to percentages and lagged.
References:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R
First we will take at our data and see if there are any preliminary trends or patterns we can find by producing graphical and numerical summaries.
library(ISLR)
attach(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
Nothing out of the ordinary when looking at a general summary of the data set.
Matrix plot of each variable against the other.
pairs(Weekly)
There are not any clear relationships or patterns between Direction and the other variables. Not surprising as if there were any clear relationships then everyone would be making money from the stock market. The only strong relationship appears to be year and volume, as the year increases so does volume.
Correlation Matrix:
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
We see the same thing as the plots, there are no relationships besides Year and Volume, which have a correlation of 0.84. No other patterns are discernible. Correlations between the “lag” variables and today’s returns are close to zero.
We will now perform a logistic regression on the full dataset using Direction as the response and the five lag variables and volume as predictors.
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Out of all the predictors the only one that is statistically significant at a signifiance level of 0.05 is Lag2, which has a p-value of 0.0296.
Confusion Matrix:
glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
From the confusion matrix we calculate the percentage of correct predictions on the training data as (54+557)/1089 which is equal to 56.11%. In contrast, the training error rate is 1-0.5611 = 0.4389 = 43.89% which is overly optimistic compared to a test error rate. The sensitivity of this model predicting if a week is up is 557/(557+48) = 92.07%, this means when the market goes up the model predicts it 92.07% of the time. The specificity is 54/(430/54) = 11.16% which is not great, as in weeks when the market drops it only correctly predicts it 11.16% of the time.
Now we will fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.
Split our data:
train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
Direction.0910 = Direction[!train]
Fit the model and create confusion:
glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.0910)
## Direction.0910
## glm.pred Down Up
## Down 9 5
## Up 34 56
From the confusion matrix we calculate the percentage of correct predictions on the training data as (9+56)/104 which is equal to 62.5%. In contrast, the training error rate is 1-0.625 = 0.375 = 37.5% which is overly optimistic compared to a test error rate. The sensitivity of this model predicting if a week is up is 56/(5+56) = 91.80%, this means when the market goes up the model predicts it 91.80% of the time. The specificity is 9/(34+9) = 20.93% which is not great, as in weeks when the market drops it only correctly predicts it 20.93% of the time.
Now we will repeat the previous part using LDA:
library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
## Direction.0910
## Down Up
## Down 9 5
## Up 34 56
The confusion matrix is identical to the previous logistic regression model we ran. This means that the error rates and sensitivity/specificity are the same. There is not much difference compared to the logistic model.
Repeat same procedure but using QDA:
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
## Direction.0910
## qda.class Down Up
## Down 0 0
## Up 43 61
We see that the model never picked Up at all, let us see how the error rate is. 61/104 = 58.65% error rate. Sensitivity = 61/61 = 100%, picked all the weeks where the market went up correctly. Specificity = 0/43 = 0, did not get any weeks right when the market was down.
Lastly, we will fit KNN with K = 1 and see how well it performs:
library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
## Direction.0910
## knn.pred Down Up
## Down 21 30
## Up 22 31
Error rate is (21+31)/104 = 0.5, which is no better than guessing. Sensitivity = 31/61 = 50.82%. Specificity = 21/43 = 48.84%. This model performed worse overall compared to the others.
Out of Logistic(full data), Logistic(lag2 only), LDA, QDA, and KNN with K = 1; the best performing method was the logistic model based on only the lag2 predictor and LDA. They had the same performance with correct prediction rates of 62.5% and error rates of 37.5%. The sensitivity and specificity were 91.80% and 20.93%, respectively. Although if you were looking to choose a model which always made the correct prediction of when the market was up, then you would go with QDA as it picked all the weeks where the market was up correctly, although was incorrect on all the weeks where the market was down.
Logistic Regression with interactino between Lag1 and Lag2
glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
## Direction.0910
## glm.pred Down Up
## Down 1 1
## Up 42 60
Correct Rate = 61/104 = 58.65% Error Rate = 1-(61/104) = 41.35% Sensitivity = 60/61 = 98.36% Specificity = 1/43 = 2.32%
LDA with interaction between Lag1 and Lag2:
da.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
## Direction.0910
## Down Up
## Down 9 5
## Up 34 56
Correct Rate = 65/104 = 62.5% Error Rate = 1-(65/104) = 37.5% Sensitivity = 56/61 = 91.80% Specificity = 9/43 = 20.93%
QDA with sqrt(abs(Lag2)):
qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
## Direction.0910
## qda.class Down Up
## Down 12 13
## Up 31 48
Correct Rate = 60/104 = 57.69% Error Rate = 1-(60/104) = 42.31% Sensitivity = 48/61 = 78.87% Specificity = 12/43 = 27.91%
KNN with K = 10:
knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
## Direction.0910
## knn.pred Down Up
## Down 17 18
## Up 26 43
Correct Rate = 60/104 = 57.69% Error Rate = 1-(60/104) = 42.31% Sensitivity = 43/61 = 70.49% Specificity = 17/43 = 39.53%
KNN with K = 100:
knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
## Direction.0910
## knn.pred Down Up
## Down 9 12
## Up 34 49
Correct Rate = 58/104 = 55.77% Error Rate = 1-(58/104) = 44.23% Sensitivity = 49/61 = 80.33% Specificity = 9/43 = 20.93%
After experimenting with various methods and different parameters, the best performing method based on the error rate is the second logistic model we built with only lag2, the original LDA, and the new LDA we created with an interaction between Lag1 and Lag2. They all had the same error rate of 37.5%. All in all, the model you select depends on what you are trying to accomplish. For example, if you want a model that will be more conservative and always predict the correct days when the week is up, then you would go with QDA.