Introduction

The objective of this analysis is to analyze the Weekly dataset found in the ISLR library. It contains 1089 weekly returns for 21 years, from 1990 to 2010. We will produce some numerical and graphical summaries. Then use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors and see if any variables are significant.Compute the confusion matrix and overall fraction of correct predictions. Then fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. In addition, we will also repeat the previous step for LDA, QDA, and KNN with k = 1. We will experiment with various combinations of parameters and then determine which method performs the best.

Data

Weekly dataset found in the ISLR library, which gives weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Format of the data:
Year: The year that the observation was recorded

Lag1: Percentage return for previous week

Lag: Percentage return for 2 weeks previous

Lag3: Percentage return for 3 weeks previous

Lag4: Percentage return for 4 weeks previous

Lag5: Percentage return for 5 weeks previous

Volume: Volume of shares traded (average number of daily shares traded in billions)

Today: Percentage return for this week

Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

Source:
Raw values of the S&P 500 were obtained from Yahoo Finance and then converted to percentages and lagged.

References:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R

Analysis

Numerical and Graphical Summaries

First we will take at our data and see if there are any preliminary trends or patterns we can find by producing graphical and numerical summaries.

library(ISLR)
attach(Weekly)
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

Nothing out of the ordinary when looking at a general summary of the data set.

Matrix plot of each variable against the other.

pairs(Weekly)

There are not any clear relationships or patterns between Direction and the other variables. Not surprising as if there were any clear relationships then everyone would be making money from the stock market. The only strong relationship appears to be year and volume, as the year increases so does volume.

Correlation Matrix:

cor(Weekly[, -9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

We see the same thing as the plots, there are no relationships besides Year and Volume, which have a correlation of 0.84. No other patterns are discernible. Correlations between the “lag” variables and today’s returns are close to zero.

Logistic Regression on Full Data

We will now perform a logistic regression on the full dataset using Direction as the response and the five lag variables and volume as predictors.

glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Out of all the predictors the only one that is statistically significant at a signifiance level of 0.05 is Lag2, which has a p-value of 0.0296.

Confusion Matrix:

glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Weekly$Direction)
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

From the confusion matrix we calculate the percentage of correct predictions on the training data as (54+557)/1089 which is equal to 56.11%. In contrast, the training error rate is 1-0.5611 = 0.4389 = 43.89% which is overly optimistic compared to a test error rate. The sensitivity of this model predicting if a week is up is 557/(557+48) = 92.07%, this means when the market goes up the model predicts it 92.07% of the time. The specificity is 54/(430/54) = 11.16% which is not great, as in weeks when the market drops it only correctly predicts it 11.16% of the time.

Logistic Regression with Lag2 Only

Now we will fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.

Split our data:

train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
Direction.0910 = Direction[!train]

Fit the model and create confusion:

glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction.0910)
##         Direction.0910
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

From the confusion matrix we calculate the percentage of correct predictions on the training data as (9+56)/104 which is equal to 62.5%. In contrast, the training error rate is 1-0.625 = 0.375 = 37.5% which is overly optimistic compared to a test error rate. The sensitivity of this model predicting if a week is up is 56/(5+56) = 91.80%, this means when the market goes up the model predicts it 91.80% of the time. The specificity is 9/(34+9) = 20.93% which is not great, as in weeks when the market drops it only correctly predicts it 20.93% of the time.

LDA

Now we will repeat the previous part using LDA:

library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
##       Direction.0910
##        Down Up
##   Down    9  5
##   Up     34 56

The confusion matrix is identical to the previous logistic regression model we ran. This means that the error rates and sensitivity/specificity are the same. There is not much difference compared to the logistic model.

QDA

Repeat same procedure but using QDA:

qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
##          Direction.0910
## qda.class Down Up
##      Down    0  0
##      Up     43 61

We see that the model never picked Up at all, let us see how the error rate is. 61/104 = 58.65% error rate. Sensitivity = 61/61 = 100%, picked all the weeks where the market went up correctly. Specificity = 0/43 = 0, did not get any weeks right when the market was down.

KNN

Lastly, we will fit KNN with K = 1 and see how well it performs:

library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
##         Direction.0910
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

Error rate is (21+31)/104 = 0.5, which is no better than guessing. Sensitivity = 31/61 = 50.82%. Specificity = 21/43 = 48.84%. This model performed worse overall compared to the others.

Model Performace and Comparison

Out of Logistic(full data), Logistic(lag2 only), LDA, QDA, and KNN with K = 1; the best performing method was the logistic model based on only the lag2 predictor and LDA. They had the same performance with correct prediction rates of 62.5% and error rates of 37.5%. The sensitivity and specificity were 91.80% and 20.93%, respectively. Although if you were looking to choose a model which always made the correct prediction of when the market was up, then you would go with QDA as it picked all the weeks where the market was up correctly, although was incorrect on all the weeks where the market was down.

Experiment with different combinations

Logistic Regression with interactino between Lag1 and Lag2

glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
##         Direction.0910
## glm.pred Down Up
##     Down    1  1
##     Up     42 60

Correct Rate = 61/104 = 58.65% Error Rate = 1-(61/104) = 41.35% Sensitivity = 60/61 = 98.36% Specificity = 1/43 = 2.32%

LDA with interaction between Lag1 and Lag2:

da.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
##       Direction.0910
##        Down Up
##   Down    9  5
##   Up     34 56

Correct Rate = 65/104 = 62.5% Error Rate = 1-(65/104) = 37.5% Sensitivity = 56/61 = 91.80% Specificity = 9/43 = 20.93%

QDA with sqrt(abs(Lag2)):

qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
##          Direction.0910
## qda.class Down Up
##      Down   12 13
##      Up     31 48

Correct Rate = 60/104 = 57.69% Error Rate = 1-(60/104) = 42.31% Sensitivity = 48/61 = 78.87% Specificity = 12/43 = 27.91%

KNN with K = 10:

knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
##         Direction.0910
## knn.pred Down Up
##     Down   17 18
##     Up     26 43

Correct Rate = 60/104 = 57.69% Error Rate = 1-(60/104) = 42.31% Sensitivity = 43/61 = 70.49% Specificity = 17/43 = 39.53%

KNN with K = 100:

knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
##         Direction.0910
## knn.pred Down Up
##     Down    9 12
##     Up     34 49

Correct Rate = 58/104 = 55.77% Error Rate = 1-(58/104) = 44.23% Sensitivity = 49/61 = 80.33% Specificity = 9/43 = 20.93%

Conclusion and Summary

After experimenting with various methods and different parameters, the best performing method based on the error rate is the second logistic model we built with only lag2, the original LDA, and the new LDA we created with an interaction between Lag1 and Lag2. They all had the same error rate of 37.5%. All in all, the model you select depends on what you are trying to accomplish. For example, if you want a model that will be more conservative and always predict the correct days when the week is up, then you would go with QDA.