Lab: Creating Spatial Models with GLMs

This lab will introduce you to modeling presence/absence data with GLMs. This is also the first full-modeling lab. A key element of this lab is examining the response of the data vs. the predicted output of the model and how both relate to the predictor variables.

Note: Remember to include the glm2 library in the code below.

GLMs in R

The Data

The function below will create synthetic presence/absence data for evaluating GLMs. Copy it into R now and compile them.

############################################################################ 
# Creates a data frame with Xs ranging from 1 to the number of entries
# and "Measures" with 1/2 set to 0 and 1/2 set to 1.  The lower have of
# the X values have measures at 0.
# ProportionRandom - amount of uniform randomness to add
############################################################################ 

Categories1D_Random=function(NumEntries=10,ProportionRandom=0.4)
{
  Range=NumEntries*ProportionRandom/2
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=Index #runif(1,0,100)
    Ys[Index]=Index #runif(1,0,100)
    Threshold=0.5
    Random=0
    if (ProportionRandom!=0) Random=runif(1,-Range,Range)
    if (Xs[Index]>NumEntries/2+Random) Measures[Index]=1
    else Measures[Index]=0
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}

The code below will create a synthetic data set for a logistic model using the function above. Try it now in R. Note that the data contains some randomness to make the values overlap a bit.

TheData=Categories1D_Random(100) # create a set of binary data 

plot(TheData$Xs,TheData$Measures) # plot the data

Fitting the Model

The code below will fit the model and plot the results.

TheModel=glm2(Measures~Xs,data=TheData,family=binomial) # logit (binary)
summary(TheModel)

The output is similar to a linear model with the exception of a few new items.

Call:
glm2(formula = Measures ~ Xs, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.19011  -0.18968   0.02669   0.18966   2.25542  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.86152    1.78795  -4.397 1.10e-05 ***
Xs           0.15882    0.03521   4.511 6.45e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.589  on 99  degrees of freedom
Residual deviance:  41.339  on 98  degrees of freedom
AIC: 45.339

Number of Fisher Scoring iterations: 7

Note that at the bottom is the AIC. This and the model itself, are the two key pieces of information for model selection. The "Null deviance" is effectively the "-2ln(L)" portion of the AIC for a "Null" model and the "Residual deviance" is the same calculation for the fitted model (i.e. the residual deviance should be less than the Null deviance. The "Fisher Scoring" is the method used to find the maximum likelihood and is not critical for model selection.

I'm not sure but I think: The standard deviation of the ceofficient represents how well the coeficient has been estimated (http://tinyurl.com/j5ccvgo). The z value is the regression coeficient divided by the standard error. If the z-value is large (>2), then the coeficient matters (http://tinyurl.com/hxexbtg). The Pr(>|z|) is like a p-value.

Note that the plots provided are described below, however, these plots are intended for linear models and can be misleading for GLMs (http://stats.stackexchange.com/questions/121490/interpretation-of-plotglm-model).

plot(TheModel)

The "residuals vs. fitted" plot shows the residuals on the y-axis and the fitted values (y values from the samples) on the x-axis. Another way to think of this is that the x-axis shows the predicted value along the trend line and the y-axis shows the difference between this value and the sample value (i.e. the residual). See PennState Regression Methods web page for more information.

The provided Normal Q-Q plot shows if the residuals are grouping. If there are distinct groups, you may need to break up the data into multiple groups for modeling (https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/BinomTools/inst/ResidualsGLM.pdf?revision=6&root=binomtools&pathrev=6).

We can obtain the residuals and histogram them as below:

residuals(TheModel)
hist(residuals(TheModel))

We can view the fitted values and plot them with:

fitted(TheModel)
plot(fitted(TheModel))

Prediction

Note that there are two functions that will create a prediction for a GLM, "predict()" and "fitted()". Sources on the web indicate that predict() does not include the "link" function so it should not be compared with the original data. However, in my experience, predict() does overlay the original data properly so it must be applying the link function and fitted() will only predict against the original model values. If you want to predict against a new set of predictors (to make a map, for instance), you'll need to use predict. Note that the "type="response"" parameter is also required.

Xs=as.vector(array(1:100)) # create a vector with the full range of x values
NewData=data.frame(Xs) # put this into a dataframe for prediction (names must match the model)

ThePrediction=predict(TheModel,newdata=NewData,type="response") # create the prediction 

plot(TheData$Xs,TheData$Measures) # plot the original points
points(Xs,ThePrediction,col=2,pch="+") # plot the prediction

Note that GLM is much pickier about it's inputs than most R functions. You may receive errors for having too few points or too many. Also, if it cannot find a reasonable result, it can also provide errors.

Issues with GLM

The function used above to create the synthetic data has two parameters. The first indicates the number of data points to generate and the second the amount of randomness or "noise" to put into the model. Try some different values for these parameters including 0 for the randomness. Note that the GLM function cannot always create a valid model.

Other "Link" Functions

Below are some of the other link functions provided by the glm2() function. Try them with your data and evaluate the AIC values against the response curves provided.

TheModel=glm2(Measures~Xs,data=TheData,family=gaussian) # identity (linear)
TheModel=glm2(Measures~Xs,data=TheData,family=poisson) # log (counts)
TheModel=glm2(Measures~Xs,data=TheData,family=binomial) # logit (binary)
TheModel=glm2(Measures~Xs,data=TheData,family=Gamma) # inverse (distances)

Note that the AIC values are not comparable across the methods for GLM. They are only valid for comparison within one method and with the same data set. We'll chat more on this later.

Example

The files below represent examples of creating models for the presence of Doug-Fir trees in California. The Doug-Fir data set can be converted to "Points with Binary Values" as explained in the Converting Data Types web page and then adding predictor layers values from rasters for the study area. The code below is an example to create a predicted output for Doug-Fir based on the new presence-absence value. Note that these are examples and are not going to work exactly as you may desire. Also, you'll need to modify these examples for your own covariates or your own response data.

#########################################################
# Read and clearn up the data
library(glm2)

# read the data into an R dataframe
TheData <- read.csv("C:/Temp/DougFir_CA2.csv")

# Remove any NA (null) values, the data should already be loaded
TheData=na.omit(TheData)

# plot the covariate against the response
plot(TheData$AnnualPrecip,TheData$PresAbs)

# create the model
TheModel=glm2(PresAbs~AnnualPrecip,data=TheData,family=binomial) # logit (binary)
summary(TheModel)

# create a new data frame for the prediction
AnnualPrecip=as.vector(array(1:2000)) # create a vector with the full range of x values
NewData=data.frame(AnnualPrecip) # put this into a dataframe for prediction (names must match the model)

# create the prediction fector
ThePrediction=predict(TheModel,newdata=NewData,type="response") # create the prediction 

# plot the original points and the predicted points
plot(TheData$AnnualPrecip,TheData$PresAbs) # plot the original points
points(AnnualPrecip,ThePrediction,col=2,pch="+") # plot the prediction

# predict based on the original model values (i.e. the entire data set)
ThePrediction=predict(TheModel,type="response") # create the prediction 

# add the prediction to the existing dataframe
FinalData=data.frame(TheData,ThePrediction)
plot(FinalData$AnnualPrecip,FinalData$ThePrediction)

# Save the prediction to a new file
write.csv(FinalData, "C:/Temp/ModelOutput.csv")

DougFir Data File - Counts of doug-fir in California based on 10km pixels. The "Value" field contains the counts. This data was created by gridding the state of California into 10km cells and counting the number of Doug-Fir trees in each of the FIA plots that appeared in each cell. The data is in WGS 84, Geographic.
Redwood trees in Humboldt County - I just recreated this data from the FIA database. It should contain counts and other information for redwood trees in FIA plots in Humboldt County (i.e. use with care until I have more time to work with it). It is in NAD 83 Geographic.
R Script - An example script for exploring and modeling data with GLMs.

Assignment

Now, investigate using GLMs for a data set of your choosing. You can use binary data, another type of data, or several different types of data. A zip file of data for different species is available on the main 470/570 schedule web page but remember that these data only contain plots that have the species present. Make sure you try different combinations of predictors. You can also try different equations within the GLM as in a linear model if appropriate. Remember we need to make sure the transformation of the residuals makes sense and the transformation on your predictors. Remember that you can also transform your predictors to make the model simpler.

See Canvas for additional details.

Extracting Data From FIA (optional)

Below is the SQL query used to count the number of redwood trees in the plots. This only provided the plots that contained redwoods so the result was then joined back to all the plots in ArcMap.

select min(plot.cn) as plot_cn,min(plot.plot) as plot_plot, -- code number for each plot
	min(lat) as lat,min(lon) as lon,min(elev) as elev,
	min(measyear) as year,min(measmon) as month,min(measday) as day,
	min(spcd) as spcd,
	count(tree.spcd) as tree_spcd,
	min(dia) as min_dia,min(ht) as min_ht, -- min diamter, min height
	max(dia) as max_dia,max(ht) as max_ht, -- max diamter,max height
	avg(cast(dia as double precision )) as mean_dia,avg(cast(ht as double precision)) as mean_ht -- mean diamter, mean height
from plot -- get records from the plot table (one record per plot)
inner join tree on cast(tree.plt_cn as bigint)=plot.cn -- Join to the trees table so we can count the number of trees
where plot.countycd= '23' -- FIPS code for Humboldt County
	and plot.statecd='6'  -- FIPS Code for California as there are 3 Humboldt Counties in the USA
	and tree.spcd='211' -- Species code for coastal redwoods
group by plot.cn -- group the trees together into a count of the number of trees

Additional Resources

Introducing R from Princeton

cv.glm, Cross-validation for Generalized Linear Models (uses k-fold)

R Examples Reopository - Crossvalidation and bootstrap prediction error for linear and generalized linear models

Old

There are some excellent labs available for R from MSU. Go through Lab 4 up to "Examples with the Use of R". The data for this lab is available in Lab 1. It is in an "R" file but needs to be loaded as a table as shown in the lab and there are some data quality issues that need to be addressed (I've made notes in a custom copy of the lab). It is also worthwhile to go through Lab 1 as practice.

R Labs for Vegetation Ecologists from MSU - the original labs
Lab 4 (local copy) - contains a few comments from Jim
Bryce Canyon Site Dataset - used in MSU lab 4, section 2 (has some problems)
Doug-Fir Dataset - works for lab 4, section 2
Linear Model R script for use with the Doug-Fir data set - a script with the steps for completing section 2 with the Doug-Fir data set
Bryce Canyon Veg Dataset - used in other MSU labs

Geospatial Activities