Introduction

For this analysis, we will implement a multi-part simulation study to evaluate the performance of a Monte Carlo cross-validation method for model selection that is very similar to “leave-k-out” cross validation. The “Monte Carlo” cross-validation method that we will be assessing consists of splitting the data into two parts. For this analysis, we’ll use one third for training and two thirds for testing.

Data

We will not be using a supplied data set but instead will be generating our own data through R. We will do this by creating data pairs (xi,yi) where i = 1,2,…,n and yi = 2 + 2x + 2x^2i + error term. Also where xi = 5i/n which will be equally spaced on the interval [0,5] and the error term will come from the standard normal distribution N(0,1).

Analysis

Part 1

Let us begin initializing the neccesary steps we need to begin this analysis:
We will first import the ‘Matrix’ library

library(Matrix)

Now we will begin to generate our previously mentioned data points by first creating mulitple copies of the vector y and storing it in a list.

generate.y = function(copies, X, b) { 
  lapply(1:copies, function(i) X%*%b + rnorm(nrow(X),0,1))
} #generate.y

Now create the least-squares estimate of beta (vector) and store it in a list.

beta.hat = function(sam=c(1:nrow(X)), A, X, y.vec) { 
  b = vector("list", length(A)) # initialize list of same length as length of A
  for (i in 1:length(A)) { # using subset of X corresponding to columns in A[[i]] and observations in 'sam'
    b[[i]] = solve( t(X[sam, A[[i]]]) %*% X[sam, A[[i]]] ) %*% t(X[sam, A[[i]]]) %*% y.vec 
  }
  return(b)
} # beta.hat

We will initialize the list of lists

get.ASPE = function(sam=c(1:nrow(X)), A, X, z, w) {
  PE = vector("list", length(z)) # initialize list of lists
  for (k in 1:length(z)) {
    PE[[k]] = vector("list", length(A))
  }
  for (i in 1:length(A)) {
    if (length(X[-sam, A[[i]]]) > 0) {
      for (k in 1:length(z)) {
        PE[[k]][[i]] = (1/length(z[[k]])) * sum( ( z[[k]] - X[-sam, A[[i]]] %*% w[[k]][[i]] )^2)
      }
    }
  }
  return(PE)
} # get.ASPE

Now it is time to split our data by using 1/3 of it for training and 2/3 for testing. Also we will simulate the variability of the optimal model for the selected split, we will generate multiple “copies” or instances of the response vector y. Each copy will result in a new least-squares estimate (from the training set) and a new average squared prediction error (from the testing set). For each copy of y, a (possibly) different optimal model will be selected, based solely on random variation in the data generating mechanism. We can then assess the performance of the procedure for a single fixed split by looking at how many times each of the 17 models indexed by matrix A was selected as optimal.

optimal.models = function(A, X, y) {

  # Split the data using 1/3 for training and 2/3 for testing
  sam = sample(c(1:nrow(X)), round(nrow(X)/3, 0))
  ym = vector("list", length(y))
  yl = vector("list", length(y))
  for (i in 1:length(y)) {
    ym[[i]] = y[[i]][sam]
    yl[[i]] = y[[i]][-sam]
  }
 
  # Compute least-squares estimate using only the training set
  b = vector("list", length(ym))
  for (i in 1:length(ym)) {
    b[[i]] = beta.hat(sam, A, X, ym[[i]]) 
  }

  # Evaluating average squared prediction error using the testing part of the data
  ASPE = get.ASPE(sam, A, X, yl, b)

  # Pick out the model with the smallest average squared prediction error
  optimal.val = sapply(lapply(ASPE, function(x) sort(unlist(x))), function(x) x[1])
  optimal.mod = vector("character",length(optimal.val))
  for (i in 1:length(optimal.val)) {
    tmp = names(A[abs(unlist(ASPE[[i]]) - optimal.val[i]) < 0.0001]) # equal to smallest ASPE to within 0.0001 precision (can be modified)
    if (length(tmp) > 1) { # in case of (near) ties, break tie at random
      optimal.mod[i] = sample(tmp, 1)
    } else{
      optimal.mod[i] = tmp
    }
  }

  # tabulate the findings, including any models for which the frequency of selection is zero
  freq = vector("integer", length(A))
  names(freq) = names(A)
  freq[names(table(optimal.mod))] = table(optimal.mod)
 
  # return the frequency table as well as the subset of observations selected for the training set
  return(list(freq=freq, train=sort(sam))) 
} # optimal.models

We will now create the collection of models.

obs = 100

## Create collection of models
x1 = rep(1,obs)
x2 = 5*(1:obs) / obs
x3 = x2*sqrt(x2)
x4 = x2^2
x5 = log(x2)
x6 = x2*x5
X = cbind(x1,x2,x3,x4,x5,x6)
dimnames(X)[[1]] = c(1:obs)

rm(x1,x2,x3,x4,x5,x6,obs) # clean up: variables only needed to build the input matrix X

A = list(c(1,2,3,5,6), c(1,2,4,5,6), 
         c(1,2,5,6), c(1,3,5,6), c(1,4,5,6),c(2,3,5,6), c(2,4,5,6), 
         c(1,2,3), c(1,2,4), c(1,2,5), c(1,2,6), c(4,5,6),
         c(1,2), c(1,3), c(1,5), c(4,5), c(2,4))
names(A) = c("(1,2,3,5,6)", "(1,2,4,5,6)", 
             "(1,2,5,6)", "(1,3,5,6)", "(1,4,5,6)", "(2,3,5,6)", "(2,4,5,6)", 
             "(1,2,3)", "(1,2,4)", "(1,2,5)", "(1,2,6)", "(4,5,6)", 
             "(1,2)", "(1,3)", "(1,5)", "(4,5)", "(2,4)") 

Then implement the simulations with our data and models. The first simulation will be a distribution of optimal models for fixed split over different instances of y. Then we will run the second simulation that will be a distribution of optimal models for single instance of y over multiple splits.

## Implement simulations ##

## Simulation 1: Distribution of optimal models for fixed split over different instances of y
copies = 1000

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          77          63           7          45         100         111 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##         171          34         389           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           3 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100

Visualization of the optimal models that were chosen for simulation 1:

plot(seed.optimal$freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for an arbitrary but fixed split over", copies, "instances of y"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), seed.optimal$freq, pch=20)
text(x=1:length(A), y=seed.optimal$freq+copies/100, labels=round(seed.optimal$freq/copies, 2), cex=0.7)

Simulation #2

## Simulation 2: Distribution of optimal models for single instance of y over multiple splits
splits = 1000

set.seed(17)
ylist = generate.y(1, X[, A[[9]]], rep(2,3))

set.seed(11)
iter.optimal = vector("list", splits)
for (iter in 1:splits) {
 iter.optimal[[iter]] = optimal.models(A, X, ylist)
}

We need to sum across the iterations to get a frequency of each model.

## Sum across iterations to get frequency of selection of each model
iter.freq = iter.optimal[[1]]$freq
for (iter in 2:splits) {
 iter.freq = iter.freq + iter.optimal[[iter]]$freq
}

iter.freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##         173         104          24           1         172         196 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##           4           0         325           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           1
Let us create and examine a visualization of the optimal models that were chosen for simulation 2:  
plot(iter.freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for a single instance of y over", splits, "splits"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), iter.freq, pch=20)
text(x=1:length(A), y=iter.freq+splits/100, labels=round(iter.freq/splits, 2), cex=0.7)

Summary of Findings

Now that we have built and ran the two Monte-Carlo simulations let us take a look at what the results were. Keep in mind that the test model is (1,2,4), so the frequency of the optimal model should be heavily favored towards this model for both simulations. We will start by looking at Simulation #1, which was the distribution of optimal models for fixed split over different instances of y. With a 1000 different copies of y, we see that when we plot the frequency of the optimal model, model (1,2,4) has the highest frequency at 0.39. This means that 39% of the iterations pick that model as being the optimal and most accurate. The second best model was (2,3,5,6) which had a frequency of 0.17 or 17%.

plot(seed.optimal$freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for an arbitrary but fixed split over", copies, "instances of y"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), seed.optimal$freq, pch=20)
text(x=1:length(A), y=seed.optimal$freq+copies/100, labels=round(seed.optimal$freq/copies, 2), cex=0.7)

Now how about Simuliation #2, which was a distribution of optimal models for single instance of y over multiple splits. Here we set the amount of splits to 1000. After running the simulation the result was that the frequency of the optimal model, model (1,2,4) has the highest frequency at 0.32. This means that 32% of the iterations pick that model as being the optimal and most accurate. The second best model was (2,3,5,6) which had a frequency of 0.2 or 20%.

plot(iter.freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for a single instance of y over", splits, "splits"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), iter.freq, pch=20)
text(x=1:length(A), y=iter.freq+splits/100, labels=round(iter.freq/splits, 2), cex=0.7)

The simulations pick the ‘true’ model at a rate of 39% and 32% for simulations 1 and 2 respectively. On this metric, simulation #1 performed better since it was able to choose the ‘true’ model more often. The reason for this could be due to the fact that simulation #2 utilized more splits, which could lead to a similar situation as ‘Leave-1-out-Cross Validation’. What I mean by this is that in LOOCV the resulting models have a high variance since the testing set can vary so much. Depending on the testing set your model would fit well or terribly. While in simulation #1 we are sampling 1000 copies and then splitting the test and training up at the same proportion each time. This leads to a more balanced and accurate result.

Part 2

Now we will see the performance of Mallows’ Cp.

Let us start coding the method, by first initializing the Cp function.

get.Cp = function(A, X, z, w) {
  Cp = vector("list", length(z)) # initialize list of lists
  for (k in 1:length(z)) {
    Cp[[k]] = vector("list", length(A))
  }
 
  for (i in 1:length(A)) {
    for (k in 1:length(z)) {
      # get estimate of variance from largest model that (you hope) contains the true model
      sigma2 = summary(lm(z[[k]] ~ -1 + X[,1] + X[,2] + X[,3] + X[,4] + X[,5] + X[,6]))$sigma^2
      # compute Cp using formula (6.2) in ISLR Section 6.1.3
      Cp[[k]][[i]] = (1/nrow(X))*(sum((z[[k]] - X[, A[[i]]] %*% w[[k]][[i]])^2) + 2*length(A[[i]])*sigma2)
    }
  }
  return(Cp)
} # get.Cp

Now to create the optimal models function:

optimal.models.Cp = function(A, X, y) {
 
  # Compute least-squares estimates using full data
  b = vector("list", length(y))
  for (i in 1:length(y)) {
    b[[i]] = beta.hat(A=A, X=X, y.vec=y[[i]]) 
  }
 
  # Evaluate Mallows Cp
  Cp = get.Cp(A, X, y, b)
 
  # Pick out the model with the smallest value
  optimal.val = sapply(lapply(Cp, function(x) sort(unlist(x))), function(x) x[1])
  optimal.mod = vector("character",length(optimal.val))
  for (i in 1:length(optimal.val)) {
    tmp = names(A[abs(unlist(Cp[[i]]) - optimal.val[i]) < 0.0001]) # equal to smallest Cp to within 0.0001 precision (can be modified)
    if (length(tmp) > 1) { # in case of (near) ties, break tie at random
      optimal.mod[i] = sample(tmp, 1)
    } else{
      optimal.mod[i] = tmp
    }
  }
 
  # tabulate the findings, including any models for which the frequency of selection is zero
  freq = vector("integer", length(A))
  names(freq) = names(A)
  freq[names(table(optimal.mod))] = table(optimal.mod)
 
  # return the frequency table as well as the subset of observations selected for the training set
  return(list(freq=freq)) 
} # optimal.models.Cp

Now we will call the mallows’ cp function based on 500 copies instead of 1000 due to resource contraint. I went back and changed the amount of copies in the previous block of code where we implemented the simulations for Monte-Carlo.

copies = 500

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          39          36           4          24          47          55 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          81          16         197           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           1 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100
Cp.optimal = optimal.models.Cp(A, X, ylist)
Cp.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          15          17           1           0          28          41 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          55           2         341           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           0

We see that the (1,2,4) is picked the most at 341 times out of 500, giving us a rate of 0.68 or 68%, which is immensely better than the Monte-Carlo Simulations.

Part 3

Now suppose we had ‘x2 = 1(1:obs) / obs’ instead of ’x2 = 5(1:obs) / obs’. We will now re-run the two simulations from Part 1 and Mallows’ Cp from Part 2.

First let us replace the ‘x2’:

obs = 100

## Create collection of models
x1 = rep(1,obs)
x2 = 1*(1:obs) / obs
x3 = x2*sqrt(x2)
x4 = x2^2
x5 = log(x2)
x6 = x2*x5
X = cbind(x1,x2,x3,x4,x5,x6)
dimnames(X)[[1]] = c(1:obs)

rm(x1,x2,x3,x4,x5,x6,obs) # clean up: variables only needed to build the input matrix X

A = list(c(1,2,3,5,6), c(1,2,4,5,6), 
         c(1,2,5,6), c(1,3,5,6), c(1,4,5,6),c(2,3,5,6), c(2,4,5,6), 
         c(1,2,3), c(1,2,4), c(1,2,5), c(1,2,6), c(4,5,6),
         c(1,2), c(1,3), c(1,5), c(4,5), c(2,4))
names(A) = c("(1,2,3,5,6)", "(1,2,4,5,6)", 
             "(1,2,5,6)", "(1,3,5,6)", "(1,4,5,6)", "(2,3,5,6)", "(2,4,5,6)", 
             "(1,2,3)", "(1,2,4)", "(1,2,5)", "(1,2,6)", "(4,5,6)", 
             "(1,2)", "(1,3)", "(1,5)", "(4,5)", "(2,4)") 

Now we will re-run the two Monte-Carlo Simulations first:

## Simulation 1: Distribution of optimal models for fixed split over different instances of y
copies = 1000

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          36          48          30          14          66          22 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          47          11          35          76          47          97 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##         178         272          15           1           5 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100

Simulation #1 Plot:

plot(seed.optimal$freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for an arbitrary but fixed split over", copies, "instances of y"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), seed.optimal$freq, pch=20)
text(x=1:length(A), y=seed.optimal$freq+copies/100, labels=round(seed.optimal$freq/copies, 2), cex=0.7)

From first look at the plot, we see that the results are drastically different, as the most selected model is (1,3) and the ‘true’ model is only picked 0.04 of the time.

Simulation #2:

## Simulation 2: Distribution of optimal models for single instance of y over multiple splits
splits = 1000

set.seed(17)
ylist = generate.y(1, X[, A[[9]]], rep(2,3))

set.seed(11)
iter.optimal = vector("list", splits)
for (iter in 1:splits) {
 iter.optimal[[iter]] = optimal.models(A, X, ylist)
}

## Sum across iterations to get frequency of selection of each model
iter.freq = iter.optimal[[1]]$freq
for (iter in 2:splits) {
 iter.freq = iter.freq + iter.optimal[[iter]]$freq
}

iter.freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          54          96          81          75         217           4 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##           2          11          49           4           5           1 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##          77         321           3           0           0

Simulation #2 Plot:

plot(iter.freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for a single instance of y over", splits, "splits"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), iter.freq, pch=20)
text(x=1:length(A), y=iter.freq+splits/100, labels=round(iter.freq/splits, 2), cex=0.7)

Looking at the plot of Simulation #2 we see the optimal model that is selected the most is (1,3) at 0.32 or 32% of the time.

Mallows’ Cp:

copies = 500

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          20          21          17           6          34          10 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          22           7          15          39          27          50 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##          91         130           8           1           2 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100

From the Mallows’ Cp output and frequency of models selected, we see that the most frequently selected model is (1,3) and not the ‘true’ model.

Summary

Let us call the orignal x2 scenario, Scenario #1, and the scenario with the new x2, Scenario #2.

Scenario #1, Original x2

In Scenario #1, the optimal model selected and it’s frequency for the three methods are:

obs = 100

## Create collection of models
x1 = rep(1,obs)
x2 = 5*(1:obs) / obs
x3 = x2*sqrt(x2)
x4 = x2^2
x5 = log(x2)
x6 = x2*x5
X = cbind(x1,x2,x3,x4,x5,x6)
dimnames(X)[[1]] = c(1:obs)

rm(x1,x2,x3,x4,x5,x6,obs) # clean up: variables only needed to build the input matrix X

A = list(c(1,2,3,5,6), c(1,2,4,5,6), 
         c(1,2,5,6), c(1,3,5,6), c(1,4,5,6),c(2,3,5,6), c(2,4,5,6), 
         c(1,2,3), c(1,2,4), c(1,2,5), c(1,2,6), c(4,5,6),
         c(1,2), c(1,3), c(1,5), c(4,5), c(2,4))
names(A) = c("(1,2,3,5,6)", "(1,2,4,5,6)", 
             "(1,2,5,6)", "(1,3,5,6)", "(1,4,5,6)", "(2,3,5,6)", "(2,4,5,6)", 
             "(1,2,3)", "(1,2,4)", "(1,2,5)", "(1,2,6)", "(4,5,6)", 
             "(1,2)", "(1,3)", "(1,5)", "(4,5)", "(2,4)") 
copies = 1000

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          77          63           7          45         100         111 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##         171          34         389           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           3 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100
plot(seed.optimal$freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for an arbitrary but fixed split over", copies, "instances of y"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), seed.optimal$freq, pch=20)
text(x=1:length(A), y=seed.optimal$freq+copies/100, labels=round(seed.optimal$freq/copies, 2), cex=0.7)

For Monte-Carlo Simulation #1, the optimal model that was picked the most was the ‘true’ model (1,2,4) at a frequency of 0.39.

Scenario #1, Monte-Carlo Simulation #2:

splits = 1000

set.seed(17)
ylist = generate.y(1, X[, A[[9]]], rep(2,3))

set.seed(11)
iter.optimal = vector("list", splits)
for (iter in 1:splits) {
 iter.optimal[[iter]] = optimal.models(A, X, ylist)
}
## Sum across iterations to get frequency of selection of each model
iter.freq = iter.optimal[[1]]$freq
for (iter in 2:splits) {
 iter.freq = iter.freq + iter.optimal[[iter]]$freq
}

iter.freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##         173         104          24           1         172         196 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##           4           0         325           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           1
plot(iter.freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for a single instance of y over", splits, "splits"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), iter.freq, pch=20)
text(x=1:length(A), y=iter.freq+splits/100, labels=round(iter.freq/splits, 2), cex=0.7)

For Monte-Carlo Simulation #2, the optimal model that was picked the most was the ‘true’ model (1,2,4) at a frequency of 0.32.

Scenario #1, Mallows’ Cp:
First, change the amount of copies

copies = 500

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          39          36           4          24          47          55 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          81          16         197           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           1 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100
Cp.optimal = optimal.models.Cp(A, X, ylist)
Cp.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          15          17           1           0          28          41 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          55           2         341           0           0           0 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##           0           0           0           0           0

For Mallows’ Cp, the optimal model that was picked the most was the ‘true’ model (1,2,4) at a frequency of 0.39.

Scenario #2, New x2:

Scenario #2, Monte-Carlo Simulation #1:
Initializations:

obs = 100

## Create collection of models
x1 = rep(1,obs)
x2 = 1*(1:obs) / obs
x3 = x2*sqrt(x2)
x4 = x2^2
x5 = log(x2)
x6 = x2*x5
X = cbind(x1,x2,x3,x4,x5,x6)
dimnames(X)[[1]] = c(1:obs)

rm(x1,x2,x3,x4,x5,x6,obs) # clean up: variables only needed to build the input matrix X

A = list(c(1,2,3,5,6), c(1,2,4,5,6), 
         c(1,2,5,6), c(1,3,5,6), c(1,4,5,6),c(2,3,5,6), c(2,4,5,6), 
         c(1,2,3), c(1,2,4), c(1,2,5), c(1,2,6), c(4,5,6),
         c(1,2), c(1,3), c(1,5), c(4,5), c(2,4))
names(A) = c("(1,2,3,5,6)", "(1,2,4,5,6)", 
             "(1,2,5,6)", "(1,3,5,6)", "(1,4,5,6)", "(2,3,5,6)", "(2,4,5,6)", 
             "(1,2,3)", "(1,2,4)", "(1,2,5)", "(1,2,6)", "(4,5,6)", 
             "(1,2)", "(1,3)", "(1,5)", "(4,5)", "(2,4)") 
copies = 1000

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          36          48          30          14          66          22 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          47          11          35          76          47          97 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##         178         272          15           1           5 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100
plot(seed.optimal$freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for an arbitrary but fixed split over", copies, "instances of y"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), seed.optimal$freq, pch=20)
text(x=1:length(A), y=seed.optimal$freq+copies/100, labels=round(seed.optimal$freq/copies, 2), cex=0.7)

For Monte-Carlo Simulation #1, the most selected model is (1,3) at 0.27 and the ‘true’ model is only picked 0.04 of the time.

Monte-Carlo Simulation #2:
Initializations

splits = 1000

set.seed(17)
ylist = generate.y(1, X[, A[[9]]], rep(2,3))

set.seed(11)
iter.optimal = vector("list", splits)
for (iter in 1:splits) {
 iter.optimal[[iter]] = optimal.models(A, X, ylist)
}
## Sum across iterations to get frequency of selection of each model
iter.freq = iter.optimal[[1]]$freq
for (iter in 2:splits) {
 iter.freq = iter.freq + iter.optimal[[iter]]$freq
}

iter.freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          54          96          81          75         217           4 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##           2          11          49           4           5           1 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##          77         321           3           0           0
plot(iter.freq, type="h", main=paste("Proportion of times each model is selected as the optimal model\n for a single instance of y over", splits, "splits"), xlab="", ylab="", bty="n", axes=F)
axis(1, at=1:length(A), labels=names(A), las=2)
points(1:length(A), iter.freq, pch=20)
text(x=1:length(A), y=iter.freq+splits/100, labels=round(iter.freq/splits, 2), cex=0.7)

For Monte-Carlo Simulation #2, the most selected optimal model is (1,3) at 0.32 or 32% of the time.

Lastly, Mallows’ Cp for Scenario #2 of the new x2:
Initializations

copies = 500

set.seed(17)
ylist = generate.y(copies, X[, A[[9]]], rep(2,3))

set.seed(11)
seed.optimal = optimal.models(A, X, ylist)
seed.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##          20          21          17           6          34          10 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          22           7          15          39          27          50 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##          91         130           8           1           2 
## 
## $train
##  [1]   1   2   5   7   9  10  12  13  16  17  23  27  28  29  30  36  37
## [18]  39  40  41  49  51  54  64  71  75  78  80  82  91  93  97 100
Cp.optimal = optimal.models.Cp(A, X, ylist)
Cp.optimal
## $freq
## (1,2,3,5,6) (1,2,4,5,6)   (1,2,5,6)   (1,3,5,6)   (1,4,5,6)   (2,3,5,6) 
##           5           8           2           7          14           3 
##   (2,4,5,6)     (1,2,3)     (1,2,4)     (1,2,5)     (1,2,6)     (4,5,6) 
##          11           4          13          23           6          29 
##       (1,2)       (1,3)       (1,5)       (4,5)       (2,4) 
##         111         264           0           0           0

For Mallows’ Cp Scenario #2, the most selected optimal model is (1,3) being selected 264 times out of 500, giving a rate of 0.53.

Conclusion & Summary

For the original x2 of “x2 = 5(1:obs) / obs“, Monte Carlo Simulation #1 resulted in picking the ‘true’ model with a frequency of 0.39 while when using the new X2 of”x2 = 1(1:obs) / obs” resulted in choosing the model (1,3) at a rate of 0.27.
For the original x2, Monte-Carlo Simulation #2 resulted in selected the ‘true’ model (1,2,4) 0.32 of the time. When using the new x2, the most selected model was (1,3) at 0.32.
For Mallows’ Cp with the original x2, the model was the ‘true’ model (1,2,4) at a frequency of 0.39. For the new x2, Mallows’ Cp selected the model (1,3) at a rate of 0.53.

We see that changing x2 resulted in all the methods selecting the ‘true’ model at a drastically lower rate. All the methods instead picked (1,3) at a higher frequency. I believe that the reason for this is due to the fact that when we use ’x2 = 1*(1:obs) / obs’ the methods utilize a smaller range of x-values for the model to pick from, so this would obviously result in a more restricted range for the methods to search for the true model.