Sampling Data for Cross-Validation
Sub-Sampling
One common approach for validating a model is to sub-sample a point data set, build, or "train", the model off the sub-sample, then "test" the model against the remaining points. The test is completed by finding the variance of the test data with the model and comparing these results to the residuals from the data used to build the model. Typically the model is built with 70% of the original data with the remaining 30% used for testing.
The "sample()" function built into R will down sample a data set with or without replacement. The data can be a vector, matrix, or a data table. However, this will not create a "training" and "test" data set.
NewData=sample(OriginalData,NumberOfSamples)
To create a "training" and "test" dataset, we need to sample the original data set with one set of indexes for the training data and then sample the original data set again with all the indexes that were not used to obtain the training data set. The function below will do this.
##################################################################### # Function to sub-sample a data frame into a training and test subset # # Inputs: # TheData - any data frame # TrainingProportion - a value from 0 to 1.0 for the size of the training data # # Outputs: # R does not return muliple values from a function automatically so # this function returns a list with the first item being the training # data and the second being the test data. # # By: Jim Graham # February 23rd, 2014 ##################################################################### Subsample=function(TheData,TrainingProportion) { NumRows=nrow(TheData) # Find the training sample size SampleSize <- floor(TrainingProportion * NumRows) # Create a sequence from 1 to the number of rows TheSequence=seq_len(NumRows) # Sample TheSequence based on the sample size to find row indexes for the training dataset TrainingRowIndexes <- sample(TheSequence, size = SampleSize) # Sample TheData to find the training dataset TrainingData <- TheData[TrainingRowIndexes, ] # Get all lines not in the TrainingRowIndexes (note the negative sign) TestData <- TheData[-TrainingRowIndexes, ] # Return the training and test data in a list return(list(TrainingData,TestData)) }
The function above returns a list with the training and test data set. The code below will call the "Subsample()" function and then get the training and test data from the returned list.
# Subsample the data into 70% training and 30% test data sets Result=Subsample(TheData,0.70) # Get the training data as the first entry in the list TrainingData=Result[[1]] # Get the test data as the second entry in the list TestData=Result[[2]]
Now you can subsample and cross-validate results from your models by creating your model from the training data, creating a prediction from your test data, and comparing the prediction to the test data (i.e. examining the residuals).
Cross-Validation: Repeated Random Sub-Sampling Validation
To reduce the chance that we accidently draw data that was either very different or very similar between the test and training data sets, we may want to resample the original data set repeatedly using different subsamples to build and test the model. This is refered to as "cross-validation" and while there are many approaches, the one shown here is the most common in spatial modeling.
The approach is to provide a while loop for the number of times you want to repeat building the model. Within the model, extract a portion of the data, build the model, test the model, and accumulate the results.
Once we have the code above for sub-sampling, all we need to do is add a "while" loop to run the model over and over sampling different training and test data stes. Then, within the loop, we run our model.
Below is an example for cross-validating a GAM model using a 70%/30% training/test split.
######################################################### # Code sample to cross-validate our model # load the GAMs library require(mgcv) # Read the data TheData = read.csv("C:\\Temp3\\DougFirGAM\\DougFir_HeightGreatThanZero.csv") # Remove any NA (null) values, the data should already be loaded TheData=na.omit(TheData) # Setup the vectors for the statistics of interest StdDevs=vector() AICs=vector() # setup a vector to contain the AIC values i=0 while (i<10) { # Get the training and test datasets Result=Subsample(TheData,0.7) TrainingData=Result[[1]] TestData=Result[[2]] # Create the model attach(TrainingData) # attach is required for the "predict.gam()" function TheModel=gam(MaxHt~s(Precip)+s(MinTemp)+s(MaxTemp),data=TrainingData,gamma=2) detach("TrainingData") # Add the AIC for this model to our results AICs=c(AICs,AIC(TheModel)) # Predict values for the test data attach(TestData) # attach is required for the "predict.gam()" function Predicted=predict.gam(TheModel,TestData,type='response') detach("TestData") # Add metrics on the residuals Residuals=Predicted-TestData$MaxHt StdDevs=c(StdDevs,sd(Residuals)) # Incrememnt our counter so we don't keep going forever i=i+1 } # Output stats on our model runs summary(AICs) summary(StdDevs)