Sampling Data for Cross-Validation

Sub-Sampling

One common approach for validating a model is to sub-sample a point data set, build, or "train", the model off the sub-sample, then "test" the model against the remaining points. The test is completed by finding the variance of the test data with the model and comparing these results to the residuals from the data used to build the model. Typically the model is built with 70% of the original data with the remaining 30% used for testing.

The "sample()" function built into R will down sample a data set with or without replacement. The data can be a vector, matrix, or a data table. However, this will not create a "training" and "test" data set.

NewData=sample(OriginalData,NumberOfSamples)

To create a "training" and "test" dataset, we need to sample the original data set with one set of indexes for the training data and then sample the original data set again with all the indexes that were not used to obtain the training data set. The function below will do this.

#####################################################################
# Function to sub-sample a data frame into a training and test subset
#
# Inputs:
#  TheData - any data frame
#  TrainingProportion - a value from 0 to 1.0 for the size of the training data
#
# Outputs:
#  R does not return muliple values from a function automatically so
#  this function returns a list with the first item being the training
#  data and the second being the test data.
#
# By: Jim Graham
# February 23rd, 2014
#####################################################################

Subsample=function(TheData,TrainingProportion)
{
  NumRows=nrow(TheData)
  
  # Find the training sample size
  SampleSize <- floor(TrainingProportion * NumRows)
  
  # Create a sequence from 1 to the number of rows
  TheSequence=seq_len(NumRows)
  
  # Sample TheSequence based on the sample size to find row indexes for the training dataset
  TrainingRowIndexes <- sample(TheSequence, size = SampleSize)
  
  # Sample TheData to find the training dataset
  TrainingData <- TheData[TrainingRowIndexes, ]
  
  # Get all lines not in the TrainingRowIndexes (note the negative sign)
  TestData <- TheData[-TrainingRowIndexes, ] 
  
  # Return the training and test data in a list
  return(list(TrainingData,TestData))
}

The function above returns a list with the training and test data set. The code below will call the "Subsample()" function and then get the training and test data from the returned list.

# Subsample the data into 70% training and 30% test data sets
Result=Subsample(TheData,0.70)

# Get the training data as the first entry in the list
TrainingData=Result[[1]]

# Get the test data as the second entry in the list
TestData=Result[[2]]

Now you can subsample and cross-validate results from your models by creating your model from the training data, creating a prediction from your test data, and comparing the prediction to the test data (i.e. examining the residuals).

Cross-Validation: Repeated Random Sub-Sampling Validation

To reduce the chance that we accidently draw data that was either very different or very similar between the test and training data sets, we may want to resample the original data set repeatedly using different subsamples to build and test the model. This is refered to as "cross-validation" and while there are many approaches, the one shown here is the most common in spatial modeling.

The approach is to provide a while loop for the number of times you want to repeat building the model. Within the model, extract a portion of the data, build the model, test the model, and accumulate the results.

Once we have the code above for sub-sampling, all we need to do is add a "while" loop to run the model over and over sampling different training and test data stes. Then, within the loop, we run our model.

Below is an example for cross-validating a GAM model using a 70%/30% training/test split.

#########################################################
# Code sample to cross-validate our model
# load the GAMs library
require(mgcv) 


# Read the data
TheData = read.csv("C:\\Temp3\\DougFirGAM\\DougFir_HeightGreatThanZero.csv")

# Remove any NA (null) values, the data should already be loaded
TheData=na.omit(TheData)

# Setup the vectors for the statistics of interest
StdDevs=vector()
AICs=vector() # setup a vector to contain the AIC values

i=0
while (i<10)
{
	# Get the training and test datasets
	Result=Subsample(TheData,0.7)
	
	TrainingData=Result[[1]]
	TestData=Result[[2]]
	
	# Create the model
	attach(TrainingData) # attach is required for the "predict.gam()" function
	TheModel=gam(MaxHt~s(Precip)+s(MinTemp)+s(MaxTemp),data=TrainingData,gamma=2)
	detach("TrainingData")
	
	# Add the AIC for this model to our results
	AICs=c(AICs,AIC(TheModel))
	
	# Predict values for the test data
	attach(TestData) # attach is required for the "predict.gam()" function
	Predicted=predict.gam(TheModel,TestData,type='response')
	detach("TestData")
	
	# Add metrics on the residuals
	Residuals=Predicted-TestData$MaxHt
	StdDevs=c(StdDevs,sd(Residuals))
	
	# Incrememnt our counter so we don't keep going forever
	i=i+1
}

# Output stats on our model runs
summary(AICs)
summary(StdDevs)

R for Spatial Statistics

Sampling Data for Cross-Validation

Sub-Sampling

Cross-Validation: Repeated Random Sub-Sampling Validation