UPCH geospatial analysis workshop

Species distribution modeling tutorial

This github page includes a tutorial of how to model the distribution of Ateles chamek (the Peruvian spider monkey) throughout the Amazon Basin using MAPBIOMAS land-use/land-cover data and CHELSA climate data.

Species presence and background data
Environmental data - MAPBIOMAS LULC & CHELSA Climate
Machine-learning based SDMs - Random Forest

1. Species presence and background data 🐒

For this model we are using Ateles chamek (the Peruvian spider moneky) as a focal species and all other terrestrial mammals as background species. The background species help us to understand the difference between the focal species and the average landscape over which mammals are sampled (thus accounting for sampling bias in the occurrence points). Point data was downloaded from GBIF using the "0_download_gbif_points.R" code. To save time we will jump in using the already downloaded dataset as referenced below, but the code will be here for your future reference.

Figure 1. Distribution of points for the focal species (A. chamek) (blue) and background species (grey) thinned to 1km grid cells across the Amazon Basin. The number of background points was further reduced by using a background point probability mask and sampling 3 * the no. of occurrence points. The code used to create the probability surface and sample the background points is in the "0_download_gbif_points.R" code. An example of using a background mask to sample background points can be found in Moyes et al. 2016.

Step 1. Download occurrence dataset from the data folder: a_chamek_ter_mammals_amazon_thinned_Oct22.csv

Uploading point data to GEE

Once you have your occurrence data downloaded, upload it to GEE as a csv. Make sure the dataframe has numerical longitude (x) and latitude (y) coordinates (in decimal degrees) as separate columns. Make sure there is a row identifier to match the points to bind multiple datasets after downloading geospatial data.

Step 2. Upload occurrence dataset as a GEE feature collection. Make sure to specify which are the "lat" and "lon" columns so GEE knows to take them in as numerics.

Figure 2. Navigation for uploading csv as a feature collection.

2. Environmental data 🌳

Species distribution models model the probability a species occurs in pixel x given the environmental conditions (covariates) in pixel x. Here, we will download land-use / land-cover and climate data to use as environmental covariates in our model. For the purposes of this tutorial, we will run the initial model with a small number of covariates.

Downloading MAPBIOMAS data 🌎

Figure 3. The MAPBIOMAS Panamazonia platform.

Step 3. Explore the MAPBIOMAS website

Step 4. Identify LULC categories / datasets of interest using the MAPBIOMAS legend (make sure it is the correct legend for the MAPBIOMAS assett)

Step 5. Explore MAPBIOMAS GEE code. MAPBIOMAS has pre-written code for most of the functions you need for SDMs.

Step 6. We will build our species distribution model at a 1km^2 resolution. MAPBIOMAS is available at a finer scale resolution (30m) so it is possible to create the model at a finer spatial resolution. Note: the resolution of all covariates should be equal to the resolution of the coarsest variable (given that the precision of your occurrence point is as or more precise than that resolution). The data we fed into GEE are points. We will create a 1km buffer around each point (step 6) and then calculate area of each land class per grid cell (step 7-10). A GEE script for steps 6-10 can be found here.

//call csv of points (make sure it has lat, long, & a row identifier)
var amazon_mammals = ee.FeatureCollection('users/cglidden/a_chamek_ter_mammals_amazon_thinned_Oct22'); // replace "cglidden" with your username

/////functions to create buffer zone around each occurence point
function bufferPoints(radius, bounds) {
  return function(pt) {
    pt = ee.Feature(pt);
    return bounds ? pt.buffer(radius).bounds() : pt.buffer(radius);
  };
}

/////Implement function to make each point is a 1km^2 polygon
var pointBuffers = amazon_mammals.map(bufferPoints(500, true)); //true = square pixel


// Paint FeatureCollection to GEE map for visualization.
var fcVis = pointBuffers.draw({color: '800080', pointRadius: 10});
Map.setCenter(-69.60, -12.39, 10) //coordinates & degree to zoom in
Map.addLayer(fcVis);

Figure 4. Part of the feature collection (1km square buffers around each point) visualized using the Map.addLayer(fcVis) function from the code above.

Step 7. We will now calculate area per each feature (point + buffer) per year in the study period (2001-2020) in the feature collection. A. chamek typically live in lowland forests but are listed as endangered due to habitat loss. For our initial model we will include lulc variables related to forest cover and farming. The MAPBIOMAS code first sets the calculations up by setting a number of parameters / variables including, among other parameters / variables, setting the identifying property of each point as 'attribute', defining territories (the features), the image (MAPBIOMAS collection), classIds (the land-class you want to calculate area for), the name of the export file, and the file to export to. You can follow along with the GEE code below or in the file on Caroline' GEE code editor, linked here and above. The code below also includes code to print the MAPBIOMAS image to view the structure and to map the first year of MAPBIOMAS data (the first band of the image).

Figure 5. The print of the MAPBIOMAS image. You will see that each year of LULC data is a band in the image. This is important infomation to know when it comes to deciding how to extract the data you want.

// Asset = mapbiomas collection that you would like to use

var asset = 'projects/mapbiomas-raisg/public/collection3/mapbiomas_raisg_panamazonia_collection3_integration_v2';

// Numeric attribute to index the point
var attribute = "row_code";

// Output csv name
var outputName = 'a_chamek_ter_mammals_lulc_Oct2022';

// Change the scale if you need.
var scale = 30;

// Define a list of years to export
var years = ['2001','2002','2003','2004','2005','2006','2007','2008','2009',
              '2010', '2011', '2012', '2013', '2014', '2015', '2016',
              '2017', '2018', '2019', '2020'];

// A list of class ids you are interested
var classIds = [
 // 1, //Bosque
    3, // Formação Florestal
//  4, // Formação Savânica
//  5, // Mangue
//  6, // Bosque inundable
//  10, // Formación Natural no Forestal
//  11, // Formación Natural no Forestal Inundable
//  12, // Formación Campestre o herbazal
//  29, // Afloramento Rochoso
//  13, // Outra Formação não Florestal
    14, //Uso Agropecuario
//  22, // Área não Vegetada
//  24, // Infraestrutura Urbana
//  30, // Mineração
//  25, // Outra Área não Vegetada
//  26, // Cuerpo de agua
//  33, // Rio, Lago e Oceano
//  34, //Glaciar
//  27, //No observado
];    
    

// Define a Google Drive output folder 
//if this folder does not already exist, this code will create the folder in your google drive
var driverFolder = 'GEEexports';

/**
 * 
 */
/**
 * 
 */
// Territory
var territory = ee.FeatureCollection(pointBuffers);

// LULC mapbiomas image - 36 bands, one image per band, band named "classification_year"
var mapbiomas = ee.Image(asset).selfMask();
print(mapbiomas) print to view image collection

var singleBandVis = { //set mapping parameters
  min: 1, //min classId
  max: 35, //max classId
  //define color ramp that will be stretched from 1 to 35
  palette: ['08A11A', '71FDD9', 'FAD7A0', 'EBB3FC', '3498DB', '1F618D']};
Map.setCenter(-69.60, -12.39, 4) //coordinates & degree to zoom in
Map.addLayer(mapbiomas.select("classification_1985"), singleBandVis, 'MAPBIOMAS')

// Image area in km2, one band named "area"
var pixelArea = ee.Image.pixelArea().divide(1000000);

// Geometry to export
var geometry = mapbiomas.geometry();

Figure 6. A map of the MAPBIOMAS data from 1985 (produced from the code above).

Step 8. Now the MAPBIOMAS code defines functions that will be used to calculate area. This code is complex and hard to understand, so don't feel overwhelemed if it does not make sense the first time we go through it. The second function takes in a band of a MAPBIOMAS image, a feature, and the feature geometry and returns area per land-class per feature (data point) as a table. The second feature initially produces a complex object because it converts the feature to an image, and then maps over the MAPBIOMAS image and the feature image to produce the sum of class pixels per feature. The first function is used within the second function to convert that complex object into a table that we can export.

/**
 * Convert a complex ob to feature collection
 * @param obj 
 */
var convert2table = function (obj) {

    obj = ee.Dictionary(obj); //create dictionary 

    var territory = obj.get('territory'); //extract a property from a feature

    var classesAndAreas = ee.List(obj.get('groups')); //extract a property from a feature and create a list

    var tableRows = classesAndAreas.map( //apply to every image in image collection
        function (classAndArea) {
            classAndArea = ee.Dictionary(classAndArea);

            var classId = classAndArea.get('class'); //extract class
            var area = classAndArea.get('sum'); //extract sum of pixels

            var tableColumns = ee.Feature(null)
                .set(attribute, territory) //set row_code
                .set('class', classId) //set id to class column
                .set('area', area); //set area to area column

            return tableColumns; //return table as funtion output
        }
    );

    return ee.FeatureCollection(ee.List(tableRows));
};

/**
 * Calculate area crossing a cover map (deforestation, mapbiomas)
 * and a region map (states, biomes, municipalites)
 * @param image 
 * @param territory 
 * @param geometry
 */
 
 
 
 //the following code creates a function that takes in a band of a MAPBIOMAS image, a feature, and the feature geometry and returns areas as a table
var calculateArea = function (image, territory, geometry) {

    //create reducer to get sum per class per territory
    var reducer = ee.Reducer.sum().group(1, 'class').group(1, 'territory');

  // now create a territories object that for each 1km pixel, add info from each feature, add the image, and then sums per group combination (feature x land-class)
  //each pixel using the reducer above
    var territoriesData = pixelArea.addBands(territory).addBands(image)
        .reduceRegion({
            reducer: reducer,
            geometry: geometry,
            scale: scale,
            maxPixels: 1e12
        });

    territoriesData = ee.List(territoriesData.get('groups')); //list for each group

    var areas = territoriesData.map(convert2table); //use code above to turn territories object into a table, obj = territoriesData

    areas = ee.FeatureCollection(areas).flatten(); //flattens collection of collections

    return areas;
};

Step 9. Now we apply the above functions to the MAPBIOMAS image and our feature collection, and map over the MAPBIOMAS image so that we create a table for each year (i.e., band of the MAPBIOMAS image that we want, defined in the year list in the code above).

//for each year, calculate area using functions above
var areas = years.map(
    function (year) {
        var image = mapbiomas.select('classification_' + year); //select the band for a year

        var areas = territory.map( //for each feature (territory) run the calculateArea function
            function (feature) {
                return calculateArea( //function from above where you need to define (image, feature, geometry)
                  //image: the MAPBIOMAS band defined above, where land classes that are not defined in class id list are remapped to 0
                    image.remap(classIds, classIds, 0),
                    //feature: the feature converted to an image with row_code as a property
                    ee.Image().int64().paint({
                        'featureCollection': ee.FeatureCollection(feature),
                        'color': attribute
                    }),
                    feature.geometry() //geometry: feature geometry
                );
            }
        );

        areas = areas.flatten(); //flatten collection

        // set additional properties
        areas = areas.map(
            function (feature) {
                return feature.set('year', year);
            }
        );

        return areas;
    }
);

areas = ee.FeatureCollection(areas).flatten(); //flatten collection of collections

Step 10. Now export the new feature collection (a table with area per land-class per data point) to the Google Drive folder that you named in the code above.

Export.table.toDrive({
    collection: areas,
    description: outputName,
    folder: driverFolder,
    fileNamePrefix: outputName,
    fileFormat: 'CSV'
});

Downloading historical climate data 🌡️

Step 11. Species distributions are usually, at least in part, dictated by interations between land cover and climate. We will use the CHELSA dataset to add climatologies to our model. The CHELSA climatologies are averages from 1985-2010 a 1km^2 resolution, which is why we scaled the MAPBIOMAS data up to 1km. We will use the same feature collection from steps 6-10. A GEE script can be found here.

/////CHELSA climatologies
var bio13_precip_wettest_month = ee.Image('users/cglidden/CHELSA_bio13_1981-2010_V21').select('b1')
            .resample("bilinear")
            .reproject("EPSG:4326", null, 1000) //make sure variable is at 1km res
            .reduce(ee.Reducer.mean()) //get mean climatology per 1km pixel
            .rename('bio13_precip_wettest_month'); //rename band - this will be the column name in your csv

var cmi_min = ee.Image('users/cglidden/CHELSA_1981_2010/CHELSA_cmi_min_1981-2010_V21').select('b1')
            .resample("bilinear")
            .reproject("EPSG:4326", null, 1000)
            .reduce(ee.Reducer.mean())
            .rename('cmi_min');

///compile image to export both variables in one csv
var final_image = bio13_precip_wettest_month.addBands([cmi_min])

  
////export table to drive
Export.table.toDrive({
  collection: final_image.reduceRegions({
    collection: pointBuffers,
    reducer: ee.Reducer.mean(), //now get mean climatology per pointBuffer and export to table
    scale: 1000, //1km resolution
    tileScale: 16
    }),
  description: 'a_chamek_ter_mammals_climate_Oct2022',
  folder: 'GEEexports',
  fileFormat: 'csv',
});

Step 12. Skip actually running the GEE code for now and download the "a_chamek_ter_mammals_lulc_Oct22.csv" file and the "a_chamek_ter_mammals_climate_Oct22.csv" file pre-downloaded data from the data folder.

Clean covariate data 🧹

R code for the following sections can be found in the "1_cleaning_data.R"

Step 13. Using the data downloaded in step 12 and the code below, we will relabel MAPBIOMAS classes to make it easier to view results. We'll then aggregate LULC data by taking the mean LULC from 2001-2020. Since A. chamek is affected by deforestation we will also look at difference in forest cover between 2001-2020. We will then merge the lulc data with the occurrence data and climate data. The climate data is exported from GEE in an immediately useable form. Note: Given the pace of LULC change, this is a really coarse way of aggregating the data and we likely loose a lot of signal.

#download the github folder and set it to your working directory
rm(list=ls())
setwd("~/Desktop/UPCH-species-distribution-tutorial-main/")

#load libraries
library(tidyr); library(dplyr)

#-----------------------------------#
# read in datasets                  #
#-----------------------------------#

mapbiomas <- read.csv("data/a_chamek_ter_mammals_lulc_Oct2022.csv")

climate <- read.csv("data/a_chamek_ter_mammals_climate_Oct2022.csv")
#subset climate data to variables of interest (not necessary but makes the final dataset cleaner
climate <- climate[  c("row_code", "bio13_precip_wettest_month", "cmi_min")]

amazon_basin_pnts <-  read.csv("data/a_chamek_ter_mammals_amazon_thinned_Oct22.csv")

#---------------------------------------#
#update label for MAPBIOMAS classes     #
#---------------------------------------#

#you can look at the classes included in the data using:
unique(mapbiomas$class)

#relabel each class to make it easier to see results
mapbiomas$class[mapbiomas$class == 3] <- "forest_formation"
mapbiomas$class[mapbiomas$class == 14] <- "farming"

#any lulc that was not defined by classIDs in you MAPBIOMAS code will be a zero here, remove these rows
mapbiomas <- mapbiomas[mapbiomas$class != 0, ]

#----------------------------------------------------------#
#go from wide to long so each class is a unique column     #
#----------------------------------------------------------#

mapbiomas_wide <- mapbiomas[2:5] %>% 
  pivot_wider(names_from = class, 
              values_from = area,
              values_fn = list(area = sum),
              values_fill = list(area = 0)) 


#---------------------------------------------------------#
# summarize average area per class per point across years #
# and difference in area over the study period            #
#---------------------------------------------------------#

mapbiomas_mean_diff <- mapbiomas_wide %>%
                  group_by(row_code) %>%
                  summarise(
                            #mean per class
                            mean_forest = mean(forest_formation),
                            mean_farming = mean(farming),
                            
                            #difference over study period
                            diff_forest_formation = forest_formation[match(2020, year)] - forest_formation[match(2001, year)])

#save dataframe
write.csv(mapbiomas_mean_diff, "data/a_chamek_ter_mammals_lulc_cleaned_Oct2022.csv")

#--------------------------------------------------------------------------------------------------------#
# merge data frames with response (presence / absence labels), climate covariates, and lulc covariates   #
#--------------------------------------------------------------------------------------------------------#

covariates <- left_join(climate, mapbiomas_mean_diff, by = "row_code")
data0 <- left_join(amazon_basin_pnts, covariates, by = "row_code")

3. Multicollinearity analysis 🤨 📈

Step 14. Using the code below, we will now clean our covariate data a bit more by removing highly colinear variables. While machine learning can handle multicolinearity when making predictions, removing colinear variables can still be helpful for model interpretation. The correlation value depends on your questions and dataset but we will use a 0.7 correlation cutoff in the code below. We will use a pair-wise analysis but another option is a variable inflation analysis (or you can use both).

#load libraries
library(PerformanceAnalytics)

# correlation in absolute terms
corr <- abs(cor(data0[7:ncol(data0)])) 

#correlation plot with points on lower left of matrix, correlation coeffecients on upper right, and distributions on the diagonal
chart.Correlation(data0[7:ncol(data0)], 
                  histogram = TRUE, method = "pearson")

#nothing with correlation > 0.7 so we keep all variabes for the analysis

Figure 7. Point plot of pairwise realtionships, covariate distribution, and Pearson's correlation coeffecients of variables.

Challenge Questions

Try using GEE to add additional climate or LULC covariates to the final analysis dataset.

4. Spatial cross-validation ✅

Step 15. Next we will split our data into 3 folds (3 subsets) for 3-fold cross validation. It is important to test the perfomance of your model using a hold-out test set. This allows you to evaluate if your model is predicting generazliable patterns, or if it is only learning the traing data (and thus "overfitting"). One way to test out-of-sample model performance is using k-fold cross-validation. K-fold cross-validation splits the data into k folds, it then trains and tests the model k times (where, for each iteration, one fold is a hold out fold and the remaning folds are used for training the model). K-fold cross-validation helps to test model performance across different subsets of data where the subsets are sampled without replacement. For many applications of species distribution modeling, it is ideal to use spatial cross-validation where folds are separated in space so to avoid issues of autocorrelation that arise from test and training points being very close to each other. See figure 5. for a visual explanation. Here we will use the R package spatialsample. Methods for splitting folds can be dependent on your data and study questions. View the blockCV paper (Valavi et al. 2021) and the spatialsample rpackage to learn of different ways to split data. Below we will use block clustering because it is quick to implement.

Figure 8. Visualization of spatial partitioning versus random test versus train set. Figure from towards data science "Spatial CV Using Sickit-learn".

library(spatialsample); library(sf)

#--------------------------------------------#
# get  fold  id  by  block  clustering       #
#--------------------------------------------#

#convert  analysis_data  to  a  spatial  object
data0_sf  <-  st_as_sf(x  =  data0,
                       coords  =  c("lon", "lat"),
                       crs  =  "+proj=longlat +datum=WGS84 +ellps=WGS84")

#identify  groups  of  3  clusters  using  the  spatialsample  package
set.seed(99)  #set  seed  to  get  same  split  each  time
clusters  <-  spatial_block_cv(data0_sf, 
                               method = "random", n = 30, #method for how blocks are oriented in space & number of blocks
                               relevant_only = TRUE,  v = 3)  #k-means  clustering  to  identify  cross-validation  folds  (3  is  too  few  to  be  robust  but  using  here  to  save  time)

#for  loop  to  create  a  dataframe  that  assigns  a  fold  number  to  each  data  point
splits_df  <-  c()
for(i  in  1:3){
  new_df  <-  assessment(clusters$splits[[i]])  #extract  points  in  fold  number  i
  new_df$fold  <-  i
  new_df  <-  new_df[  c("row_code", "fold")]
  splits_df  <-  rbind(splits_df, new_df)  #bind  all  points  x  fold  id  together
}

splits_df  <-  st_drop_geometry(splits_df)  #drop  geometry

#final  data  -  merge  cluster  id  to  final  dataset  for  analysis
analysis_data  <-  merge(data0, splits_df, by  =  "row_code")

#sanity  check:  check  how  many  data  points  are  in  each  fold (make sure folds have adequate data in them)
table(analysis_data$fold)


#write  dataframe  to  save  for  later
write.csv(analysis_data, "data/a_chamek_ter_mammals_finalData_Oct22.csv")

5. Choice of statistical model -- Random Forest 💻

There are multiple statistical models (e.g., generalized additive models) and machine learning models (e.g., maxent, random forest, boosted regression tree) that can be used for SDMs. You can read more about different SDM modeling methods in Valavi et al. 2022. Here, we will use a random forest model as they typically have high accuracy (as they deal with non-linearity and higher-order interactions well) and are fast to implement. Random forest is a classification algorithm that takes the average of multiple decision trees. You can learn more about random forests by watching this video.

Step 16. Now we train the model on each set of k-1 folds and test it on the holdout fold. For each iteration, we tune the randomForest model to optimize model performance. Depending on your dataset and the parameter values you search over, the tuning step can also be used to prevent over-fitting. There are different methods for tuning a machine-learning model. Below we use a hypergrid search and select the final parameters based on the combination that yields the best model performance. If you had trouble running Step 10. you can download the "a_chamek_ter_mammals_finalData_Oct22.csv" to use for the next few steps. R code for the following sections can be found in the "2_random_forest_sdm.R"

library(ranger)

#------------------------------------#
#tune, train, model                  #
#------------------------------------#

#first  reduce  data  down  to  covariates  of  interest  (or  you  could  specify  it  in  the  formula  below)

analysis_data_v2  <-  analysis_data[  c("presence", "fold", "bio13_precip_wettest_month", "cmi_min",
                                        "mean_forest",  "mean_farming", "diff_forest_formation")]

#for many ML algorithms  you should make sure your response variables are not grouped in the data-frame (i.e.,  all 1s and 0s next to each other)
analysis_data_v2 <- analysis_data_v2[sample(1:nrow(analysis_data_v2)), ]

#create  empty  dataframe  to  for  loop  to  store  results    one  row  for  each  fold
rf_performance  <-  data.frame(model  =  rep("RF", 3),
                               fold_id  =  1:3,
                               auc  =  rep(NA, 3),
                               presence  =  rep(NA, 3),    #number  of  presence  points  in  the  fold
                               background  =  rep(NA, 3))  #number  of  bkg  points  in  the  fold

#create  empty  dataframe  to  store  parameters  used  to  train  each  model
hypergrid_final  <-  data.frame(mtry  =  rep(NA,3),    #the  number  of  variables  to  randomly  sample  as  candidates  at  each  split
                                                            node_size    =  rep(NA, 3),    #minimum  number  of  samples  within  the  terminal  nodes
                                                            sampe_size  =  rep(NA, 3))    #the  number  of  samples  to  train  on

    
    
for(i  in  1:3){  #  run  one  iteration  per  fold
    
    train  <-  analysis_data_v2[analysis_data_v2$fold !=  i, ];  train  <-  train[-2]
    test  <-  analysis_data_v2[analysis_data_v2$fold ==  i, ];  test  <-  test[-2]
    
    #remove  any  rows  with  NAs  bc  RF  can't  handle  missing  data
    train_complete  <-  train[complete.cases(train), ]
    test_complete  <-  test[complete.cases(test), ]
    
    
    #-----------------------------------------------#
    #define  the  grid  to  search  over            #
    #-----------------------------------------------#
    #  the  function  below  creates  a  grid  with  all  combinations  of  parameters
    
    hyper_grid  <-  expand.grid(
        mtry =  seq(1, 3, by  =  1),    #the  number  of  variables  to  randomly  sample  as  candidates  at  each  split
        node_size =  seq(1,4, by  =  1),    #shallow trees
        sampe_size  =  c(.6, .70, .80),    #the  number  of  samples  to  train  on
        OOB_RMSE =  0
    )
    
    #tune  model
    for(j  in  1:nrow(hyper_grid)){
        
        #  train  model
        model  <-  ranger(
            formula  =  presence  ~  .,    
            data  =  train_complete,    
            num.trees  =  2000,  
            mtry  =  hyper_grid$mtry[j],
            min.node.size  =  hyper_grid$node_size[j],  
            sample.fraction  =  hyper_grid$sampe_size[j], 
            probability  =  TRUE, 
            replace = TRUE,
            splitrule = "hellinger",
            seed  =  123
            )
        
        #  add  OOB  error  to  grid
        hyper_grid$OOB_RMSE[j]  <-  sqrt(model$prediction.error)
    }
    
    #arrange  the  hypergrid  so  the  lowest  out-of-bag  error  (best  performing  set  of  parameters)  is  in  the  first  row
    hyper_grid2  <-  hyper_grid  %>%  
        dplyr::arrange(OOB_RMSE)
    
    #train  model
    train_model  <-  ranger(
        formula  =  presence  ~  .,    
        data  =  train_complete,   
        #use  the  first  row  of  the  grid  as  model  parameters
        num.trees  =  2000,  
        mtry  =  hyper_grid2$mtry[1], 
        min.node.size  =  hyper_grid2$node_size[1], 
        sample.fraction  =  hyper_grid2$sampe_size[1],
        probability  =  TRUE, 
        replace = TRUE,
        splitrule = "hellinger",
        seed  =  123)
    
    #save  model  performance  results
    pred0  <-  predict(train_model, data=test_complete);  pred  <-  pred0$predictions[,1]
    auc  <-  pROC::roc(response=test_complete[,"presence"], predictor=pred, levels=c(0, 1), auc  =  TRUE)
    rf_performance[i, "auc"]  <-  auc$auc
    rf_performance[i, "presence"]  <-  nrow(subset(test, presence  ==  1))
    rf_performance[i, "background"]  <-  nrow(subset(test, presence  ==  0))
    
    #save hypergrid results to use for final model
    hypergrid_final[i, "mtry"]  <-  hyper_grid2$mtry[1]
    hypergrid_final[i, "node_size"]  <-  hyper_grid2$node_size[1]
    hypergrid_final[i, "sampe_size"]  <-  hyper_grid2$sampe_size[1]

}

#mean auc = 0.71
mean(rf_performance$auc)

Training of final model 👩‍💻

Step 17. Now train the final model that will be used for interpretation and prediction. Below, I used the average hyperpameters from the tuning steps.

#------------------------------------------------------------#
# train  final  model                                        #
#------------------------------------------------------------#

final_model  <-  ranger(
    formula  =  presence  ~  .,    
    data  =  analysis_data_v2[complete.cases(analysis_data_v2), -2],    #complete case dataset without fold column
    #parameters used here are the averages from hypergrid_final
    num.trees  =  2000 ,
    mtry  =  2,  
    min.node.size  =  1,  
    sample.fraction  =  0.6,
    probability  =  TRUE,  
    replace = TRUE,
    splitrule = "hellinger",
    importance  =  'permutation',    #specify  this  to  get  variable  importance  in  the  next  step
    seed  =  123)

#check in-sample auc
pred0  <-  predict(final_model, data=analysis_data_v2[complete.cases(analysis_data_v2), -2]);  pred  <-  pred0$predictions[,1]
auc  <-  pROC::roc(response=analysis_data_v2[complete.cases(analysis_data_v2), -2][,"presence"], predictor=pred, levels=c(0, 1), auc  =  TRUE)
auc$auc

6. Model interpretation and evaluation 📊 📉

Variable importance

Step 18. There are many ways to calculate variable importance. Here, we will use an intuitive and model agnostic measure of variable importance. In sum, we will calculate change in model performance when a focal variable is randomly permuted, which will tell us the degree to which the variable contributes to model performance and thus accuracy of model predictions.

#------------------------------------------------------------#
# variable  importance                                       #
#------------------------------------------------------------#

#extract  model  results  to  get  permutation  importance
permutation_importance  <-  data.frame(variable  =  rownames(as.data.frame(final_model$variable.importance)),  
                                                                          importance  =  as.vector(final_model$variable.importance))
    
#plot  importance
ggplot(permutation_importance, aes(x  =  variable, y  =  importance))  +
    geom_bar(stat="identity")  +
    ggtitle("permutation  importance")  +
    ylab("importance (change in model error)") + 
    coord_flip()  +
    theme_classic()

Figure 9. Permutation variable importance for each covariate in the model. The y-axis is the change in model error when the variable is permuted.

Partial dependence plots

Step 19. Now you will visualize your model results using partial dependence plots. Partial dependence plots (PDPs) depict the relationship between the probability of species occurrence and the variable of interest across a range of values for that variable. At each value of the variable, the model is evaluated for all values of the other covariates. The final model output is the average predicted probability across all combinations of the other covariates.

#------------------------------------------------------------#
# pdps                                                       #
#------------------------------------------------------------#

#try plotting a PDP for just one variable
pdp::partial(final_model, pred.var  =  "mean_forest", prob  =  TRUE, plot = TRUE, train  =  analysis_data_v2[complete.cases(analysis_data_v2), -2])
#train = data without NAs & without "fold" column

###now run a for loop to get plotting data for all variables in the model (or in the 'var_names' list
#list of covariate names to generate pdps for and loop through
var_names  <-  names(analysis_data_v2[complete.cases(analysis_data_v2),  -c(1, 2)]) #analysis dataset exlucing 'presence' and 'fold' column

#dataframe  to  make  partial  dependence  plots
pd_df  =  data.frame(matrix(vector(), 0, 3, dimnames=list(c(), c('variable', 'value', 'yhat'))),  
                     row.names  =  NULL, stringsAsFactors=F)

#loop  through  each  variable
for  (j  in  1:length(var_names))  { 
    
    output  <-  as.data.frame(pdp::partial(final_model, pred.var  =  var_names[j], prob  =  TRUE, train  = analysis_data_v2[complete.cases(analysis_data_v2), -2]))
    
    loop_df  <-  data.frame(variable = rep(var_names[j], length(output[[1]])),
                            value  =  output[[1]],
                            yhat  =  output[[2]])
    
    pd_df  <-  rbind(pd_df, loop_df)  
}


#plot  pdps  for  each  variable
ggplot(pd_df, aes(x  =  value, y=  yhat))  +
    geom_smooth()  +
    ylab("probability")  +
    facet_wrap(~variable, scales  =  "free")  +
    theme_bw(base_size = 14)

Figure 10. Partial dependence plots for each covariate in the model.

Model predictions

Step 20. Using the code below, you will use the final model to map the distribution of A. chamek across Madre de Dios, Peru (a section of the Amazon Basin). Although our model was trained across the Amazon Basin, we will only plot predictions for Madre de Dios here to decrease computational time for tutorial purposes. Once you validate your model and are happy with the biological interpration gained from variable importance and partial dependence plots, you can use the model to map the distribution of the species. To do this, you create a grid of your area of interest (the area that the occurrence points were distributed over) and determine the environmental covariates in each grid cell. You then use your model to predict the probability of vector occurrence within each grid cell and display this data using a map.

#------------------------------------------------------------#
#prediction  map                                             #                                              
#------------------------------------------------------------#

#path for rasters of each covariate in Madre de Dios (we will just plot MDD for now to reduce computational time and file size)
env_data <- list.files(path="env_data", pattern="tif", all.files=FALSE, full.names=TRUE,recursive=TRUE)
e <- raster::stack(env_data)

prediction_df <- as.data.frame(rasterToPoints(e)) ##this gets raster value for every grid cell of MDD

#reduce dataset to complete cases
prediction_df_complete <- prediction_df[complete.cases(prediction_df), ]

#predict probability of species occurrence each 1km grid cell of the area of interest
predictions <- predict(final_model,
                       data=prediction_df_complete)

prediction_df_full <- cbind(prediction_df_complete, as.data.frame(predictions$predictions)[,2])
names(prediction_df_full)[ncol(prediction_df_full)] <- "probability"

#reduce dataset to only the long(x), lat (y), and variable of interest (probability)
rf_tiff_df <- prediction_df_full[, c("x", "y", "probability")]

rf_sdm_raster <- rasterFromXYZ(rf_tiff_df)

#save raster
outfile <- writeRaster(rf_sdm_raster, filename='final_figures/rf_sdm_example_predictions.tif', format="GTiff",options=c("INTERLEAVE=BAND","COMPRESS=LZW"), overwrite=TRUE)

Figure 11. Model predictions of probability of species occurrence throughout Madre De Dios, Peru.

Challenge questions

How would you add uncertainty to variable importance, functional relationships (pdps), or prediction maps?
How does your model perfromance and interpretation change when you add / remove covariates?
Pick a new focal species to create a SDM. You will have to use the "notThinned" datasets to select a focal species and create background points.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
Amazon_Basin		Amazon_Basin
Madre_de_Dios		Madre_de_Dios
R_code		R_code
data		data
env_data		env_data
final_figures		final_figures
old_data		old_data
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UPCH geospatial analysis workshop

Species distribution modeling tutorial

1. Species presence and background data 🐒

Uploading point data to GEE

2. Environmental data 🌳

Downloading MAPBIOMAS data 🌎

Downloading historical climate data 🌡️

Clean covariate data 🧹

3. Multicollinearity analysis 🤨 📈

Challenge Questions

4. Spatial cross-validation ✅

5. Choice of statistical model -- Random Forest 💻

Training of final model 👩‍💻

6. Model interpretation and evaluation 📊 📉

Variable importance

Partial dependence plots

Model predictions

Challenge questions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UPCH geospatial analysis workshop

Species distribution modeling tutorial

1. Species presence and background data 🐒

Uploading point data to GEE

2. Environmental data 🌳

Downloading MAPBIOMAS data 🌎

Downloading historical climate data 🌡️

Clean covariate data 🧹

3. Multicollinearity analysis 🤨 📈

Challenge Questions

4. Spatial cross-validation ✅

5. Choice of statistical model -- Random Forest 💻

Training of final model 👩‍💻

6. Model interpretation and evaluation 📊 📉

Variable importance

Partial dependence plots

Model predictions

Challenge questions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages