-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-Sample-size.Rmd
408 lines (308 loc) · 18.3 KB
/
04-Sample-size.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
# (PART) Part two – Soil Sampling Design {-}
# Determining the minimum sampling size
Several strategies exist for designing soil sampling, including regular, random, and stratified sampling. Each strategy comes with its own set of advantages and limitations, which must be carefully considered before commencing a soil sampling campaign. Regular sampling, also called grid sampling, is straightforward and ensures uniform coverage, making it suitable for spatial analysis and detecting trends. However, it may introduce bias and miss small–scale variability. Generally, random sampling may require a larger number of samples to accurately capture soil variability compared to stratified sampling, which is more targeted. Nonetheless, from a statistical standpoint, random sampling is often preferred. It effectively minimizes selection bias by giving every part of the study area an equal chance of being selected. This approach yields a sample that is truly representative of the entire population, leading to more accurate, broadly applicable conclusions. Random sampling also supports valid statistical inferences, ensures reliability of results, and simplifies the estimation of errors, thereby facilitating a broad spectrum of statistical analyses.
The determination of both the number and locations of soil samples is an important element in the success of any sampling campaign. The chosen strategy directly influences the representativeness and accuracy of the soil data collected, which in turn impacts the quality of the conclusions drawn from the study.
In this manual, we make use of the data from Vietnam as stored in the Google Earth repository of FAO-GPS (digital-soil-mapping-gsp-fao) for the Nghe An province. We want to determine the minimal number of soil samples that must be collated to capture at least the 95% of variability within the environmental covariates. The procedure start with random distribution of a low number of samples in the area, determine the values of the spatial covariates, and compare them with those representing the whole diversity in the area at pixel scale. The comparisons are made using the `'Kullback–Leibler divergence (KL)'` – a measure of how the probability distribution of the information in the samples is different from that of the Population, i.e. the covariate space. We also calculate the `'% of representativeness'` as the percent of variability in the covariate information for the complete area related to the variability of covariate information in the sample dataset.
The initial section of the script is related to set–up options in the methodology. We load of R packages, define the working directory, load covariate data, and store it as `SpatRaster` object. Variables related to several aspects of the analyses, such as the aggregation factor of covariates (optional), the creation of a raster stack object(required in the `clhs` function), the initial and final number of samples in the trials, the increment step between trials, and the number of iterations within each trial, are also defined.
```{r setup_wd_03, eval=FALSE, include=FALSE, echo=FALSE}
# Load packages as a vector objects
# List of packages
packages <- c("sp","terra","raster","sf","clhs", "sgsR","entropy", "tripack",
"manipulate","dplyr","plotly","synoptReg")
lapply(packages, require, character.only = TRUE) # Load packages
rm(packages) # Remove object to save memory space
# Set working directory to source file location
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#setwd("../") # Move wd down to main folder
```
```{r load_wd_03, eval=TRUE, include=FALSE}
# Set working directory to source file location
#setwd("/Users/luislado/Dropbox/Github/Sampling_Design_GSP")
```
```{r user-variables-04, eval=TRUE, include=TRUE}
# Path to rasters
raster.path <- "data/rasters"
# Path to shapes
shp.path <- "data/shapes"
# Path to results
results.path <- "data/results/"
# Path to additional data
other.path <- "data/other/"
# Aggregation factor for up-scaling raster covariates (optional)
agg.factor = 5
```
As in the previous section, covariates are PCA-transformed to avoid collinearity in the data and Principal Component rasters representing 99% of the information are retained for the analyses.
```{r load_data_03, eval=TRUE, include=TRUE, warning=FALSE}
## Load raster covariate data
# Read Spatial data covariates as rasters with terra
cov.dat <- list.files(raster.path, pattern = "tif$", recursive = TRUE, full.names = TRUE)
cov.dat <- terra::rast(cov.dat) # SpatRaster from terra
# Aggregate stack to simplify data rasters for calculations
cov.dat <- aggregate(cov.dat, fact=agg.factor, fun="mean")
# Load shape of district
nghe <- sf::st_read(file.path(paste0(shp.path,"/Nghe_An.shp")),quiet=TRUE)
# Crop covariates on administrative boundary
cov.dat <- crop(cov.dat, nghe, mask=TRUE)
# Simplify raster information with PCA
pca <- raster_pca(cov.dat)
# Get SpatRaster layers
cov.dat <- pca$PCA
# Create a raster stack to be used as input in the clhs::clhs function
cov.dat.ras <- raster::stack(cov.dat)
# Subset rasters
cov.dat <- pca$PCA[[1:first(which(pca$summaryPCA[3,]>0.99))]]
cov.dat.ras <- cov.dat.ras[[1:first(which(pca$summaryPCA[3,]>0.99))]]
# convert to dataframe
cov.dat.df <- as.data.frame(cov.dat)
```
Fig. \@ref(fig:fig-5) shows the covariates retained for the analyses.
```{r fig-5, eval=TRUE, include=TRUE, fig.cap="Plot of the covariates"}
## Plot covariates
plot(cov.dat)
```
```{r variable_inicialization_03, eval=TRUE, include=TRUE}
## Define the number of samples to be tested in a loop (from initial to final) and the step of the sequence
initial.n <- 50 # Initial sampling size to test
final.n <- 300 # Final sampling size to test
by.n <- 10 # Increment size
iters <- 10 # Number of trials on each size
```
The second section is where the analyses of divergence and representativeness of the sampling scheme are calculated.
The analyses are performed in a loop using growing numbers of samples at each trial. Some empty vectors are defined to store the output results at each loop. At each trial of sample size `'N'`, soil samples are located at locations where the amount of information in the covariates is maximized according to the conditioned Latin Hypercube sampling method in the `'clhs'` package [@Roudier2011]. A number of `r iters` replicates are calculated to determine the amount inter–variability in KL divergence and representativeness in the trial. The final results for each sample size correspond to the mean results obtained from each iteration at the corresponding sample size. The minimum sample size selected correspond to the size that accounts for at least 95% of the variability of information in the covariates within the area. The optimal sampling schema proposed correspond to the random scheme at the minimum sample size with higher value of representativeness.
```{r, fig-6, fig.cap = "Distribution of covariates in the sample space", include = TRUE}
# Define empty vectors to store results
number_of_samples <- c()
prop_explained <- c()
klo_samples <-c()
samples_storage <- list()
for (trial in seq(initial.n, final.n, by = by.n)) {
for (iteration in 1:iters) {
# Generate stratified clhs samples
p.dat_I <- clhs(cov.dat.ras,
size = trial, iter = 10000,
progress = FALSE, simple = FALSE)
# Get covariate values for each point
p.dat_I <- p.dat_I$sampled_data
# Get the covariate values at points as dataframe and delete NAs
p.dat_I.df <- as.data.frame(p.dat_I@data) %>%
na.omit()
# Store samples as list with point coordinates
samples_storage[[paste0("N", trial, "_", iteration)]] <- p.dat_I
## Comparison of population and sample distributions - Kullback-Leibler (KL) divergence
# Define quantiles of the study area (number of bins)
nb <- 25
# Quantile matrix of the covariate data
q.mat <- matrix(NA, nrow = (nb + 1), ncol = nlyr(cov.dat))
j = 1
for (i in 1:nlyr(cov.dat)) {
ran1 <- minmax(cov.dat[[i]])[2] - minmax(cov.dat[[i]])[1]
step1 <- ran1 / nb
q.mat[, j] <-
seq(minmax(cov.dat[[i]])[1],
to = minmax(cov.dat[[i]])[2],
by = step1)
j <- j + 1
}
# q.mat
# Hypercube of covariates in study area
# Initialize the covariate matrix
cov.mat <- matrix(1, nrow = nb, ncol = ncol(q.mat))
cov.dat.mx <- as.matrix(cov.dat.df)
for (i in 1:nrow(cov.dat.mx)) {
for (j in 1:ncol(cov.dat.mx)) {
dd <- cov.dat.mx[[i, j]]
if (!is.na(dd)) {
for (k in 1:nb) {
kl <- q.mat[k, j]
ku <- q.mat[k + 1, j]
if (dd >= kl && dd <= ku) {
cov.mat[k, j] <- cov.mat[k, j] + 1
}
}
}
}
}
# Compare whole study area covariate space with the selected sample
# Sample data hypercube (the same as for the raster data but on the sample data)
h.mat <- matrix(1, nrow = nb, ncol = ncol(q.mat))
for (i in 1:nrow(p.dat_I.df)) {
for (j in 1:ncol(p.dat_I.df)) {
dd <- p.dat_I.df[i, j]
if (!is.na(dd)) {
for (k in 1:nb) {
kl <- q.mat[k, j]
ku <- q.mat[k + 1, j]
if (dd >= kl && dd <= ku) {
h.mat[k, j] <- h.mat[k, j] + 1
}
}
}
}
}
## Compute Kullback-Leibler (KL) divergence
kl.index <- c()
for (i in 1:ncol(cov.dat.df)) {
kl <- KL.empirical(c(cov.mat[, i]), c(h.mat[, i]))
kl.index <- c(kl.index, kl)
klo <- mean(kl.index)
}
## Calculate the proportion of "env. variables" in the covariate spectra that fall within the convex hull of variables in the "environmental sample space"
# Principal component of the data sample
pca.s = prcomp(p.dat_I.df, scale = TRUE, center = TRUE)
scores_pca1 = as.data.frame(pca.s$x)
# Plot the first 2 principal components and convex hull
rand.tr <-
tri.mesh(scores_pca1[, 1], scores_pca1[, 2], "remove") # Delaunay triangulation
rand.ch <- convex.hull(rand.tr, plot.it = F) # convex hull
pr_poly <-
cbind(x = c(rand.ch$x), y = c(rand.ch$y)) # save the convex hull vertices
# PCA projection of study area population onto the principal components
PCA_projection <-
predict(pca.s, cov.dat.df) # Project study area population onto sample PC
newScores = cbind(x = PCA_projection[, 1], y = PCA_projection[, 2]) # PC scores of projected population
# Check which points fall within the polygon
pip <-
point.in.polygon(newScores[, 2], newScores[, 1], pr_poly[, 2], pr_poly[, 1], mode.checked =
FALSE)
newScores <- data.frame(cbind(newScores, pip))
klo_samples <- c(klo_samples, klo)
prop_explained <-
c(prop_explained, sum(newScores$pip) / nrow(newScores) * 100)
number_of_samples <- c(number_of_samples, trial)
# print(
# paste(
# "N samples = ",
# trial,
# " out of ",
# final.n,
# "; iteration = ",
# iteration,
# "; KL = ",
# klo,
# "; Proportion = ",
# sum(newScores$pip) / nrow(newScores) * 100
# )
# )
}
}
```
Figure \@ref(fig:fig-7) shows the distribution of covariates in the sample space, and Figure \@ref(fig:fig-7a) indicates the variability in the estimations of KL divergence and repressentativeness percent in the `r iters` within each sample size.
```{r fig-7, eval=TRUE, include=TRUE, fig.cap="Distribution of covariates in the sample space"}
# Plot the polygon and all points to be checked
plot(newScores[,1:2], xlab="PCA 1", ylab="PCA 2",
xlim=c(min(newScores[,1:2], na.rm = T), max(newScores[,1:2], na.rm = T)),
ylim=c(min(newScores[,1:2], na.rm = T), max(newScores[,1:2], na.rm = T)),
col='black',
main='Environmental space plots over the convex hull of soil legacy data')
polygon(pr_poly,col='#99999990')
# # Plot points outside convex hull
points(newScores[which(newScores$pip==0),1:2], col='red', pch=12, cex =1)
```
```{r results_03, eval=TRUE, include=FALSE}
## Merge data from number of samples, KL divergence and % representativeness
results <- data.frame(number_of_samples,klo_samples,prop_explained)
names(results)<-c("N","KL","Perc")
# Calculate mean results by N size
mean_result <- results %>%
group_by(N) %>%
summarize_all(mean)
mean_result
```
```{r fig-7a, fig.cap="Boxplot of the dispersion in KL and % repressentativeness in the iteration trials for each sample size",fig.width=8, fig.height=5, eval=TRUE, include=TRUE}
## Plot dispersion on KL and % by N
par(mar = c(5, 4, 5, 5))
boxplot(Perc ~ N, data=results, col = rgb(1, 0.1, 0, alpha = 0.5),ylab = "%")
mtext("KL divergence",side=4,line=3)
# Add new plot
par(new = TRUE,mar=c(5, 4, 5, 5))
# Box plot
boxplot(KL ~ N, data=results, axes = FALSE,outline = FALSE,
col = rgb(0, 0.8, 1, alpha = 0.5), ylab = "")
axis(4, at=seq(0.02, 0.36, by=.06), label=seq(0.02, 0.36, by=.06), las=3)
# Draw legend
par(xpd=TRUE)
legend("top", inset=c(1,-.15) ,c("% coincidence", "KL divergence"), horiz=T,cex=.9,
box.lty=0,fill = c(rgb(1, 0.1, 0, alpha = 0.5), rgb(0, 0.8, 1, alpha = 0.5)))
```
```{r margins, eval=TRUE, include=FALSE}
par(xpd=FALSE)
```
```{r results–table–03, eval=TRUE, include=FALSE}
# knitr::kable(
# mean_result, booktabs = TRUE,
# caption = 'Mean results for each trial of sample size at increasing steps of 10 samples.')
```
We determine the minimum sample size and plot the evaluation results.
```{r minimum_03, eval=TRUE, include=FALSE}
# Create an exponential decay function (of the KL divergence)
x <- mean_result$N
y = (mean_result$KL-min(mean_result$KL))/(max(mean_result$KL)-min(mean_result$KL)) #KL
# Parameterize Exponential decay function
start <- list() # Initialize an empty list for the starting values
#fit function
k=2;
b0=0.01
b1 = 0.01
fit1 <- nls(y ~ k * exp(-b1 * x) + b0, start = list(k=k, b0=b0, b1=b1), control = list(maxiter = 500),trace=T)
summary(fit1)
# Plot fit
xx <- seq(1, final.n,1)
plot(x, y)
lines(xx, predict(fit1,list(x=xx)))
# Predict with vfit function
jj <- predict(fit1,list(x=xx))
normalized = 1 - (jj - min(jj)) / (max(jj) - min(jj))
# Determine the minimum sample size to account for 95% of cumulative probability of the covariate diversity
minimum_n <- length(which(normalized <0.95))+1
```
The following figure shows the cumulative distribution function (cdf) of the KL divergence and the % of representativeness with growing sample sizes. Representativeness increases with the increasing sample size, while KL divergence decreases as expected. The red dot identifies the trial with the minimum sample size for the area in relation to the covariates analysed.
```{r fig-8, fig.cap="KL Divergence and Proportion of Representativeness as function of sample size",fig.width=8, fig.height=5, eval=TRUE, include=TRUE}
## Plot cdf and minimum sampling point
x <- xx
y <- normalized
mydata <- data.frame(x,y)
opti <- mydata[mydata$x==minimum_n,]
plot_ly(mydata,
x = ~x,
y = ~normalized,
mode = "lines+markers",
type = "scatter",
name = "CDF (1–KL divergence)") %>%
add_trace(x = ~x,
y = ~jj,
mode = "lines+markers",
type = "scatter",
yaxis = "y2",
name = "KL divergence") %>%
add_trace(x = ~opti$x,
y = ~opti$y,
yaxis = "y",
mode = "markers",
name = "Minimum N",
marker = list(size = 8, color = '#d62728',line = list(color = 'black', width = 1))) %>%
layout(xaxis = list(title = "N",
showgrid = T,
dtick = 50,
tickfont = list(size = 11)),
yaxis = list(title = "1–KL divergence (% CDF)", showgrid = F ),
yaxis2 = list(title = "KL divergence",
overlaying = "y", side = "right"),
legend = list(orientation = "h", y = 1.2, x = 0.1,
traceorder = "normal"),
margin = list(t = 50, b = 50, r = 100, l = 80),
hovermode = 'x') %>%
config(displayModeBar = FALSE)
```
According to Figure \@ref(fig:fig-8), the minimum sampling size for the area, which captures at least 95% of the environmental variability of covariates is N = `r minimum_n`.
Finally, we can determine the optimal distribution of samples over the study area according to these specific results, taking into account the minimum sampling size and the increasing interval in the sample size. The results are shown in Figure \@ref(fig:fig-9).
```{r fig-9, fig.cap="Covariates and optimal distribution of samples", eval=TRUE, include=TRUE}
## Determine the optimal iteration according to the minimum N size
optimal_iteration <- results[which(abs(results$N - minimum_n) == min(abs(results$N - minimum_n))),] %>%
mutate(IDX = 1:n()) %>%
filter(Perc==max(Perc))
# Plot best iteration points
N_final <- samples_storage[paste0("N",optimal_iteration$N,"_", optimal_iteration$IDX)][[1]]
plot(cov.dat[[1]])
points(N_final)
```
In summary, we utilize the variability within the covariate data to ascertain the minimum number of samples required to capture a minimum of 95% of this variability. Our approach involves assessing the similarities in variability between the sample space and the population space (study area) through calculations of the Kullback–Leibler (KL) divergence and the percentage of similarity at various stages of increasing sample sizes. These results are then utilized to fit a model representing the expected distribution of representativeness as a function of sample size. This model guides us in determining the minimum sample size necessary to achieve a representation of at least 95% of the environmental diversity within the area