We demonstrate the necessity of the spatial explicit
discretization (discretization with spatial soft constraints) method
using the Depression.csv
data from the gdverse
package. In this analysis, we select Depression_prevalence
as the dependent variable Y
and PopulationDensity
as the independent variable X. X is randomly shuffled, and the
q-value is calculated using both discretization methods that do and do
not account for spatial soft constraints, with the process repeated 3000
times. We selected hierarchical clustering with spatial soft constraints
and hierarchical clustering without spatial soft constraints method to
represent explicit and implicit spatial discretization, respectively. At
the same time, natural breaks method and robust discretization are used
as the control in this experiment. The hierarchical clustering with
spatial soft constraints method, hierarchical clustering without spatial
soft constraints method and the natural breaks method were implemented
using the sdsfun
package, and the robust discretization is
implemented by gdverse
package. In subsequent analyses, all
discretization methods applied will discretize the dataset into five
strata for the calculation of the q-values.
dt = system.file('extdata/Depression.csv',package = 'gdverse') |>
readr::read_csv() |>
sf::st_as_sf(coords = c('X','Y'), crs = 4326) |>
dplyr::select(y = Depression_prevelence,
x = PopulationDensity)
## Rows: 1072 Columns: 13
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): X, Y, Depression_prevelence, PopulationDensity, Population65, NoHealthInsurance, Neig...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sdsfun::moran_test(dt)
## *** global moran test
Variable | MoranI | EI | VarI | zI | pI |
---|---|---|---|---|---|
y | 0.339557*** | -0.0009337 | 0.0003192 | 19.06 | 2.892e-81 |
x | 0.365364*** | -0.0009337 | 0.0003192 | 20.5 | 1.052e-93 |
geom = sf::st_geometry(dt)
y = dt$y
x = dt$x
x_naturaldisc = sdsfun::discretize_vector(x,5,method = "natural")
x_hclustdisc = sdsfun::hclustgeo_disc(dplyr::select(dt,x),5,alpha = 0)
x_robustdisc = gdverse::robust_disc(y ~ .,
data = tibble::tibble(y = y,x = x),
discnum = 5, cores = 1) |>
dplyr::pull(1) |>
as.factor() |>
as.integer()
qv_naturaldisc = gdverse::factor_detector(y,x_naturaldisc)[[1]]
qv_naturaldisc
## [1] 0.07985365
qv_hclustdisc = gdverse::factor_detector(y,x_hclustdisc)[[1]]
qv_hclustdisc
## [1] 0.07315131
qv_robustdisc = gdverse::factor_detector(y,x_robustdisc)[[1]]
qv_robustdisc
## [1] 0.1199684
In the repeated experiment, X is shuffled 3000 times, and the global Moran’s I of the original independent variable X serves as the degree of spatial soft constraint (0.365 in this case). Subsequently, the q-values for each simulation round is computed.
mc_simq = \(times = 1000, cores = 6){
doclust = FALSE
if (inherits(cores, "cluster")) {
doclust = TRUE
} else if (cores > 1) {
doclust = TRUE
cores = parallel::makeCluster(cores)
on.exit(parallel::stopCluster(cores), add = TRUE)
}
calcul_q = \(n_sim) {
x_sim = sample(x)
x_sim = sf::st_set_geometry(tibble::tibble(x_sim = x_sim),geom)
g = sdsfun::moran_test(x_sim)
moran = dplyr::pull(g$result,2)
moran_p = dplyr::pull(g$result,6)
x_sed = sdsfun::hclustgeo_disc(x_sim, n = 5, alpha = 0.365)
qv1 = gdverse::factor_detector(y,x_sed)[[1]]
res = tibble::tibble(qv_sed = qv1, moran = moran,
moran_p = moran_p)
return(res)
}
if (doclust) {
parallel::clusterExport(cores,varlist = c("geom","x","y"))
out_g = parallel::parLapply(cores,seq(1,times,by = 1),calcul_q)
out_g = tibble::as_tibble(do.call(rbind, out_g))
} else {
out_g = purrr::map_dfr(seq(1,times,by = 1),calcul_q)
}
return(out_g)
}
qv = mc_simq(times = 3000, cores = 12)
qv
## # A tibble: 3,000 × 3
## qv_sed moran moran_p
## <dbl> <dbl> <dbl>
## 1 0.182 0.0000884 0.477
## 2 0.182 0.00891 0.291
## 3 0.195 -0.0236 0.898
## 4 0.194 -0.0201 0.858
## 5 0.195 -0.00266 0.538
## 6 0.188 -0.00397 0.567
## 7 0.181 -0.0275 0.931
## 8 0.156 0.00970 0.276
## 9 0.167 0.0132 0.215
## 10 0.165 -0.0266 0.925
## # ℹ 2,990 more rows
ggplot2::ggplot(data = qv, ggplot2::aes(x = qv_sed)) +
ggplot2::geom_histogram(ggplot2::aes(y = ggplot2::after_stat(density)),
bins = 50, fill = "skyblue",
color = "black", alpha = 0.7) +
ggplot2::geom_density(color = "red", linewidth = 1.2) +
ggplot2::scale_x_continuous(expand = c(0.01, 0)) +
ggplot2::scale_y_continuous(expand = c(0, 0)) +
ggplot2::labs(x = "Q Value",
y = "Density") +
ggplot2::theme_classic()
We selected simulated q-values that are more consistent with the actual situation, specifically those for which the global Moran’s I of the permuted data is statistically significant (p-value < 0.05) and greater than zero.
qv_sign = dplyr::filter(qv,moran>=0&moran_p<=0.05)
qv_sign
## # A tibble: 163 × 3
## qv_sed moran moran_p
## <dbl> <dbl> <dbl>
## 1 0.173 0.0428 0.00721
## 2 0.174 0.0343 0.0243
## 3 0.179 0.0359 0.0196
## 4 0.203 0.0286 0.0492
## 5 0.205 0.0297 0.0430
## 6 0.168 0.0494 0.00243
## 7 0.156 0.0292 0.0458
## 8 0.197 0.0295 0.0444
## 9 0.182 0.0374 0.0160
## 10 0.161 0.0384 0.0139
## # ℹ 153 more rows
ggplot2::ggplot(data = qv_sign, ggplot2::aes(x = qv_sed)) +
ggplot2::geom_histogram(ggplot2::aes(y = ggplot2::after_stat(density)),
bins = 50, fill = "skyblue",
color = "black", alpha = 0.7) +
ggplot2::geom_density(color = "red", linewidth = 1.2) +
ggplot2::scale_x_continuous(expand = c(0.01, 0)) +
ggplot2::scale_y_continuous(expand = c(0, 0)) +
ggplot2::labs(x = "Q Value",
y = "Density") +
ggplot2::theme_classic()
The q-value derived from classifying into five categories using the natural breaks method is 0.07985, while the q-value from five-category hierarchical clustering without spatial constraints is 0.07315. In comparison, the q-value obtained through robust discretization into five categories is 0.11997. In the Monte Carlo simulation experiment, the q-values range from 0.11659 to 0.22601. Among randomized simulations that best align with actual conditions, the q-value range spans from 0.11952 to 0.21663. These results indicate that spatially implicit discretization tends to underestimate the q-value of the variable under moderate spatial autocorrelation, highlighting the importance of spatially explicit discretization. The results also reveal that robust discretization, which accounts for the relationship between independent and dependent variables, can effectively reduce the q-value estimation bias introduced by spatially implicit discretization methods. However, it still underestimates the q-value by approximately 35.1796%. This highlights the need for a spatially explicit discretization approach that fully considers the relationship between independent and dependent variables to more accurately capture the impact of spatial dependence on discretization.