Redundancy analysis (RDA) is a dimension reduction statistical approach for multivariate data sets. Dimension reduction techniques are often used on large, noisy, data sets to strip away any unnecessary components, leaving only those part that best capture any explanatory/predictor power. Even if you have not heard of RDA, it is very possible that you have encounter one of the main linear techniques for dimensionality reduction; principle component analysis (PCA).
Some background reading relevant to what we will be doing:
- Talbot et al. 2016
- Phillips et al. 2006
- Rollins et al. 2015
- Prentis et al. 2008
- Colautti and Lau 2015
PCA is used to discern underlying structures (i.e. principle components) that capture variance in a data set. For more information on the fundamentals of PCA I would recommend visiting Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction.
RDA is similar to PCA, however it incorporates the uses of response and explanatory variables, whereas analysis such as PCA doesn’t incorporate the explanatory variables.
RDA is traditionally used on response variables such as:
- SNP (single nucleotide polymorphism) genetic data: see example
- Ecological community data: see example1 and example 2
For this example, I have used the vignettes available on the above two links, as well as additional notes from Natalie Hofmeister and myself.
Below I apply the approach to a multivariate phenotypic dataset. The aim is to see if the environmental data associated with the areas from which the wild-caught toads were collected carry any explanatory power for various phenotypic measures. I.e. we want to explore the link between environment and phenotype. As listed above, RDA can help you pick out environmental correlations in genotype data, and could conceivably be used to link genotype measures to that of phenotype.
Use of RDA for the below analysis can be either quantitative or qualitative. The direct RDA outputs may be useful for you, or they may simply serve as an exploratory tool to direct further analysis.
The following data is one generated from my honours research. It contains phenotype variables for a number of cane toads sourced from one of three states: QLD, NT, or WA.
Let’s import the data and have a quick look at it
setwd("C:/Users/z5188231/Desktop/Coding/Files/EnvironmentalComputing")
ToadData <- read.csv("CaneToad_Phenotypes.csv",stringsAsFactors=TRUE,sep=",")
str(ToadData)## 'data.frame': 121 obs. of 19 variables:
## $ Exercise : int 1 0 1 1 1 1 1 1 1 0 ...
## $ Diet : int 0 1 1 1 1 0 1 0 0 0 ...
## $ Source.Location : Factor w/ 3 levels "NT","QLD","WA": 1 1 1 1 1 1 1 1 1 1 ...
## $ Clutch : Factor w/ 8 levels "Innisfail 1",..: 4 4 4 5 5 5 5 5 5 5 ...
## $ Individual.number : int 442 445 453 402 403 404 405 406 407 409 ...
## $ Body.Length : num 28.7 25.4 29.3 28.9 32 32.1 31.9 54.3 24.1 37.5 ...
## $ Leg.length..mm. : num 0.376 0.343 0.334 0.36 0.347 ...
## $ Arm.length..mm. : num 0.254 0.22 0.242 0.235 0.234 ...
## $ Head.Width..mm. : num 0.345 0.331 0.341 0.356 0.338 ...
## $ Heart.Mass..g. : num 0.000627 0.000276 0.000614 0.000519 0.000812 ...
## $ Liver.Mass..g. : num 0.003066 0.000906 0.002628 0.001419 0.002875 ...
## $ Struggle..number.of.kicks. : int 8 3 9 3 0 2 3 3 2 4 ...
## $ Righting.time..s. : int 1 1 1 1 3 1 3 1 1 1 ...
## $ Distance.traveled.in.1.min..cm. : int 573 496 721 775 534 890 365 1140 450 512 ...
## $ Spontaneous.Acitivty...95.quantile..cm.: num 14.4 10.3 31.1 62.6 32.8 ...
## $ exhaustion.score : num 4.33 4.96 3.89 4.42 4.45 ...
## $ Stamina.score : num -0.263 0.301 -0.316 -0.132 -0.135 ...
## $ Recovery.score : num 0.1553 -0.2401 0.1488 -0.0193 0.2098 ...
## $ ln.RadioTelemetry.distance : num NA NA NA NA 6.8 NA 7.1 10.2 NA NA ...We see there is a total of 121 toads in this data set. The first 5 columns of these data denote information about the treatment levels (exercise/diet), source location, clutch, and and individual ID for each toad. The remaining 14 columns the contain phenotypic data.
First we check the data for any missingness, and make sure our matrix does not have any NA’s.
sum(is.na(ToadData)) # we see there is 91 missing data## [1] 91colnames(ToadData)[colSums(is.na(ToadData)) > 0] #explore data to identify where they are located## [1] "ln.RadioTelemetry.distance"ToadData <- ToadData[c(1:18)] #remove offending data
sum(is.na(ToadData)) #0 missing data## [1] 0In the above chunk we discovered 91 pieces of missing information. This is because some variables were not measured across all toads. It appears all the NA’s were in the final variable column - ln.RadioTelemetry.distance. So to prepare our data for the RDA, we need to remove this offending variable. Make sure to find the most appropriate way to resolve this should your data set have missing values. You may be able to remove rows, remove columns, generate an average etc. but you don’t want to reduce your data set drastically or remove too many variables/individuals.
Next we retain only the phenotype data columns, as these are the ones we are interested on conducting the RDA on.
ToadVars <- ToadData[c(6:18)]
str(ToadVars)## 'data.frame': 121 obs. of 13 variables:
## $ Body.Length : num 28.7 25.4 29.3 28.9 32 32.1 31.9 54.3 24.1 37.5 ...
## $ Leg.length..mm. : num 0.376 0.343 0.334 0.36 0.347 ...
## $ Arm.length..mm. : num 0.254 0.22 0.242 0.235 0.234 ...
## $ Head.Width..mm. : num 0.345 0.331 0.341 0.356 0.338 ...
## $ Heart.Mass..g. : num 0.000627 0.000276 0.000614 0.000519 0.000812 ...
## $ Liver.Mass..g. : num 0.003066 0.000906 0.002628 0.001419 0.002875 ...
## $ Struggle..number.of.kicks. : int 8 3 9 3 0 2 3 3 2 4 ...
## $ Righting.time..s. : int 1 1 1 1 3 1 3 1 1 1 ...
## $ Distance.traveled.in.1.min..cm. : int 573 496 721 775 534 890 365 1140 450 512 ...
## $ Spontaneous.Acitivty...95.quantile..cm.: num 14.4 10.3 31.1 62.6 32.8 ...
## $ exhaustion.score : num 4.33 4.96 3.89 4.42 4.45 ...
## $ Stamina.score : num -0.263 0.301 -0.316 -0.132 -0.135 ...
## $ Recovery.score : num 0.1553 -0.2401 0.1488 -0.0193 0.2098 ...We have now finished preparing the phenotype data (the response variables) for our RDA analysis.
Some points to consider:
- Make sure that your response variables are independent. For instance, with this toad data set, I originally had measurements of Body Length, Arm Length, Leg Length etc. These measures would presumably be largely correlated with one another.
- Solution: To deal with this, either omit redundant response data, or compute it in such a way that it represents a more independent aspect of the phenology (in this case I calculated limb/head lengths as a ratio of total body length).
- The data should be normally distributed.
- Solution: Transform variables that do not comply. The transformation type needed depends on the data, so make sure to do your research. A good place to start is the Handbook of Biological Statistics.
The first step here is to grab global climate data from WorldClim using the raster package.
library(raster)
climdata <- getData('worldclim',download=TRUE,var='bio',res=5)This contains 19 different climate variables, measured globally. From this we want to select only the locations that are relevant for our study. Your metadata file should be just two columns: latitude and longitude for each individual (121), written in the format as below (and in the same order as your response data set).
setwd("C:/Users/z5188231/Desktop/Coding/Files/EnvironmentalComputing")
popcoords <- read.csv("metadata.csv",stringsAsFactors=TRUE,sep=",")
head(popcoords)## long lat
## 1 131.3253 -12.5648
## 2 131.3253 -12.5648
## 3 131.3253 -12.5648
## 4 131.3253 -12.5648
## 5 131.3253 -12.5648
## 6 131.3253 -12.5648str(popcoords)## 'data.frame': 121 obs. of 2 variables:
## $ long: num 131 131 131 131 131 ...
## $ lat : num -12.6 -12.6 -12.6 -12.6 -12.6 ...How we have all the sample sites, we can pull out the climate data from global dataset for each individual coordinate, and label them in a human readable format (variable names obtained from the WorldClim website.
points <- SpatialPoints(popcoords, proj4string=climdata@crs)
values <- extract(climdata,points)
envdata <- cbind.data.frame(popcoords,values)
str(envdata)## 'data.frame': 121 obs. of 21 variables:
## $ long : num 131 131 131 131 131 ...
## $ lat : num -12.6 -12.6 -12.6 -12.6 -12.6 ...
## $ bio1 : num 273 273 273 273 273 273 273 273 273 273 ...
## $ bio2 : num 111 111 111 111 111 111 111 111 111 111 ...
## $ bio3 : num 59 59 59 59 59 59 59 59 59 59 ...
## $ bio4 : num 1851 1851 1851 1851 1851 ...
## $ bio5 : num 352 352 352 352 352 352 352 352 352 352 ...
## $ bio6 : num 164 164 164 164 164 164 164 164 164 164 ...
## $ bio7 : num 188 188 188 188 188 188 188 188 188 188 ...
## $ bio8 : num 281 281 281 281 281 281 281 281 281 281 ...
## $ bio9 : num 244 244 244 244 244 244 244 244 244 244 ...
## $ bio10: num 292 292 292 292 292 292 292 292 292 292 ...
## $ bio11: num 244 244 244 244 244 244 244 244 244 244 ...
## $ bio12: num 1395 1395 1395 1395 1395 ...
## $ bio13: num 334 334 334 334 334 334 334 334 334 334 ...
## $ bio14: num 1 1 1 1 1 1 1 1 1 1 ...
## $ bio15: num 103 103 103 103 103 103 103 103 103 103 ...
## $ bio16: num 887 887 887 887 887 887 887 887 887 887 ...
## $ bio17: num 4 4 4 4 4 4 4 4 4 4 ...
## $ bio18: num 391 391 391 391 391 391 391 391 391 391 ...
## $ bio19: num 4 4 4 4 4 4 4 4 4 4 ...colnames(envdata) <- c("long","lat",
"AnnualMeanTemp","MeanDiurnalRange",
"Isothermality","TempSeasonality",
"MaxTempofWarmestMonth","MinTempofColdestMonth",
"TempAnnualRange","MeanTempofWettestQuarter",
"MeanTempofDriestQuarter","MeanTempofWarmestQuarter",
"MeanTempofColdestQuarter","AnnualPrecipitation",
"PrecipitationofWettestMonth","PrecipitationofDriestMonth",
"PrecipitationSeasonality","PrecipitationofWettestQuarter",
"PrecipitationofDriestQuarter","PrecipitationofWarmestQuarter",
"PrecipitationofColdestQuarter")
str(envdata)## 'data.frame': 121 obs. of 21 variables:
## $ long : num 131 131 131 131 131 ...
## $ lat : num -12.6 -12.6 -12.6 -12.6 -12.6 ...
## $ AnnualMeanTemp : num 273 273 273 273 273 273 273 273 273 273 ...
## $ MeanDiurnalRange : num 111 111 111 111 111 111 111 111 111 111 ...
## $ Isothermality : num 59 59 59 59 59 59 59 59 59 59 ...
## $ TempSeasonality : num 1851 1851 1851 1851 1851 ...
## $ MaxTempofWarmestMonth : num 352 352 352 352 352 352 352 352 352 352 ...
## $ MinTempofColdestMonth : num 164 164 164 164 164 164 164 164 164 164 ...
## $ TempAnnualRange : num 188 188 188 188 188 188 188 188 188 188 ...
## $ MeanTempofWettestQuarter : num 281 281 281 281 281 281 281 281 281 281 ...
## $ MeanTempofDriestQuarter : num 244 244 244 244 244 244 244 244 244 244 ...
## $ MeanTempofWarmestQuarter : num 292 292 292 292 292 292 292 292 292 292 ...
## $ MeanTempofColdestQuarter : num 244 244 244 244 244 244 244 244 244 244 ...
## $ AnnualPrecipitation : num 1395 1395 1395 1395 1395 ...
## $ PrecipitationofWettestMonth : num 334 334 334 334 334 334 334 334 334 334 ...
## $ PrecipitationofDriestMonth : num 1 1 1 1 1 1 1 1 1 1 ...
## $ PrecipitationSeasonality : num 103 103 103 103 103 103 103 103 103 103 ...
## $ PrecipitationofWettestQuarter: num 887 887 887 887 887 887 887 887 887 887 ...
## $ PrecipitationofDriestQuarter : num 4 4 4 4 4 4 4 4 4 4 ...
## $ PrecipitationofWarmestQuarter: num 391 391 391 391 391 391 391 391 391 391 ...
## $ PrecipitationofColdestQuarter: num 4 4 4 4 4 4 4 4 4 4 ...library(vegan)
ToadVars.rda <- rda(ToadVars ~ AnnualMeanTemp + TempSeasonality + MinTempofColdestMonth + AnnualPrecipitation, data=VIFpredictions, scale=T)
ToadVars.rda## Call: rda(formula = ToadVars ~ AnnualMeanTemp + TempSeasonality +
## MinTempofColdestMonth + AnnualPrecipitation, data =
## VIFpredictions, scale = T)
##
## Inertia Proportion Rank
## Total 13.00000 1.00000
## Constrained 0.89205 0.06862 4
## Unconstrained 12.10795 0.93138 13
## Inertia is correlations
##
## Eigenvalues for constrained axes:
## RDA1 RDA2 RDA3 RDA4
## 0.5232 0.1982 0.1340 0.0366
##
## Eigenvalues for unconstrained axes:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 2.7133 1.9616 1.3353 1.1553 0.9783 0.9078 0.7673 0.7193 0.5474 0.3590
## PC11 PC12 PC13
## 0.2879 0.2192 0.1561We can now check how much variation is explained by out RDA.
RsquareAdj(ToadVars.rda)## $r.squared
## [1] 0.06861938
##
## $adj.r.squared
## [1] 0.036502813.6% is not large, but then consider that phenotype is the end result of a large number of factors (genetics, epigenetics etc…), of which environmental effects are only one! And our environmental data set does not even use all environment data.
screeplot(ToadVars.rda) #to determine how many RDA axes there are / are important if there are many
signif.full <- anova.cca(ToadVars.rda, parallel=getOption("mc.cores"))
signif.full # test for model significance## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
##
## Model: rda(formula = ToadVars ~ AnnualMeanTemp + TempSeasonality + MinTempofColdestMonth + AnnualPrecipitation, data = VIFpredictions, scale = T)
## Df Variance F Pr(>F)
## Model 4 0.8921 2.1366 0.001 ***
## Residual 116 12.1079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1vif.cca(ToadVars.rda) # HERE IT IS! Get variance inflation factors for each predictor (refer to step 2.2, method three) ## AnnualMeanTemp TempSeasonality MinTempofColdestMonth
## 46.660123 2.297345 17.553396
## AnnualPrecipitation
## 19.625697The last quality check repots back some high VIF values. This is not entirely unexpected as our VIFpredictions dataset had high collinearity for AnnualMeanTemp. If we were to redo this with PCApredictions we may have lower numbers, indicating it would be better to use the variables generated by the PCA.
We will proceed with graphing our RDA.
plot(ToadVars.rda, scaling=3)
RDA may be interpreted similarly to PCA, with each ordination axis capturing some of the data variance. The blue arrows indicate the magnitude and direction of the environmental predictors. The arrangement of the circles (the toads), the red crosses (our phenotypic measurements) and the environmental predictors against the different axis informs us what is loaded on each of said axis.
Now we can format it to make it more intuitive.
location <- c((rep("NT",30)),(rep("QLD",57)),(rep("WA",34)))
VIFpredictions <- cbind(VIFpredictions,location) str(VIFpredictions)
## 'data.frame': 121 obs. of 5 variables:
## $ AnnualMeanTemp : num 273 273 273 273 273 273 273 273 273 273 ...
## $ TempSeasonality : num 1851 1851 1851 1851 1851 ...
## $ MinTempofColdestMonth: num 164 164 164 164 164 164 164 164 164 164 ...
## $ AnnualPrecipitation : num 1395 1395 1395 1395 1395 ...
## $ location : Factor w/ 3 levels "NT","QLD","WA": 1 1 1 1 1 1 1 1 1 1 ...levels(VIFpredictions$location) <- c("NT","QLD","WA")
eco <- VIFpredictions$location
bg <- c("#2ACBB3","#602ACB","#A2C61C")
plot(ToadVars.rda, type="n", scaling=3)
points(ToadVars.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3) # the phenotypes
points(ToadVars.rda, display="sites", pch=21, cex=1.3, col="gray32", scaling=3, bg=bg[eco]) # the toads
text(ToadVars.rda, scaling=3, display="bp", col="#0868ac", cex=1) # the environmental predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)
Trying to label the phenotypic variables
plot(ToadVars.rda, type="n", scaling=3, choices=c(1,3))
points(ToadVars.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3) # the phenotypes
#text (ToadVars.rda, display="species", cex=0.65, pos=3,col="red")
points(ToadVars.rda, display="species", pch=21, cex=1.3, col="gray32", scaling=3) # the toads
points(ToadVars.rda, display="sites", pch=21, cex=1.3, col="gray32", scaling=3, bg=bg[eco]) # the toads
text(ToadVars.rda, scaling=3, display="bp", col="#0868ac", cex=1) # the environmental predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)
load.rda <- scores(ToadVars.rda, choices=c(1:3), display="species")
outliers <- function(x,z){
lims <- mean(x) + c(-1, 1) * z * sd(x) # find loadings +/-z sd from mean loading
x[x < lims[1] | x > lims[2]] # locus names in these tails
}
# check for normal distribution
hist(load.rda[,1], main="Loadings on RDA1")
hist(load.rda[,2], main="Loadings on RDA2")
hist(load.rda[,3], main="Loadings on RDA3")
# get number of candidates > 1.5 SDs from mean loading
cand1 <- outliers(load.rda[,1],1.5)
cand2 <- outliers(load.rda[,2],1.5)
cand3 <- outliers(load.rda[,2],1.5)
cand1## Leg.length..mm. Recovery.score
## -0.7828865 0.3592628cand2## Distance.traveled.in.1.min..cm. exhaustion.score
## 0.3529830 -0.4115757cand3## Distance.traveled.in.1.min..cm. exhaustion.score
## 0.3529830 -0.4115757








