$$ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} $$
Consider a linear regression model where the response variable y depends on k dummy variables D1, D2, …, Dk. Each dummy variable Dj indicates membership in a particular category or group, taking the value 1 if an observation belongs to category j and 0 otherwise. The model is expressed as:
yi = β0 + β1Di1 + β2Di2 + ⋯ + βkDik + ϵi
where: yi is the response variable for the i-th observation, β0 is the intercept term, βj is the coefficient associated with dummy variable Dj, Dij is the value of dummy variable Dj for the i-th observation, which is 1 if the i-th observation belongs to category j and 0 otherwise, ϵi is the error term.
Each observation belongs to exactly one of the categories represented
by the dummy variables. If the i-th observation belongs to category
j, then Dij = 1
and all other dummy variables are 0
for that observation.
Thus, the regression equation for observation i belonging to category j simplifies to:
yi = β0 + βj + ϵi
This equation holds for all observations i that belong to category j.
The predicted value ŷj for an
observation in category j is
the expected value of yi, given that
Dij = 1
and the other dummy variables are 0
. Since the error term
ϵi has an
expected value of 0, the predicted value for category j is:
ŷj = 𝔼(yi|Dij = 1) = β0 + βj
In linear regression, the coefficients β0, β1, …, βk are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals:
$$ \text{Minimize } \sum_{i=1}^n (y_i - \widehat{y}_i)^2 $$
The OLS solution leads to the estimate of βj for each dummy variable such that the predicted value ŷj = β0 + βj represents the mean value of yi for all observations in category j. Specifically, OLS ensures that:
$$ \widehat{y}_j = \frac{1}{N_j} \sum_{i: D_{ij} = 1} y_i = \bar{y}_j $$
where Nj is the number of observations in category j, and ȳj is the mean response for category j.
$$ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} = \frac{\sum\limits_{i = 1}^{N}\left({y_i - \widehat{y}_i}^2\right)}{\sum\limits_{i = 1}^{N}\left({y_i - \overline{y}}^2\right)} = R^2 $$
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(sesp)
library(gdverse)
NTDs = sf::st_as_sf(gdverse::NTDs, coords = c('X','Y'))
system.time({
go1 = sesp(incidence ~ ., data = NTDs, discvar = 'none',
model = 'ols', overlay = 'intersection')
})
## user system elapsed
## 0.22 0.03 0.65
system.time({
go2 = gd(incidence ~ ., data = NTDs,
type = c('factor','interaction'))
})
## user system elapsed
## 0.04 0.00 0.08
go1$factor
## # A tibble: 3 × 5
## Variable Qvalue AIC BIC LogLik
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 watershed 0.638 -10.0 -10.0 15.0
## 2 elevation 0.607 1.18 1.18 7.41
## 3 soiltype 0.386 79.7 79.7 -33.8
go2$factor
## # A tibble: 3 × 3
## variable `Q-statistic` `P-value`
## <chr> <dbl> <dbl>
## 1 watershed 0.638 0.000129
## 2 elevation 0.607 0.0434
## 3 soiltype 0.386 0.372
go1$interaction
## # A tibble: 3 × 7
## Variable Interaction Qv1 Qv2 Qv12 Variable1 Variable2
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 watershed ∩ elevation Enhance, bi- 0.638 0.607 0.714 watershed elevation
## 2 watershed ∩ soiltype Enhance, bi- 0.638 0.386 0.736 watershed soiltype
## 3 elevation ∩ soiltype Enhance, bi- 0.607 0.386 0.664 elevation soiltype
go2$interaction
## # A tibble: 3 × 6
## variable1 variable2 Interaction `Variable1 Q-statistics` `Variable2 Q-statistics`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 watershed elevation Enhance, bi- 0.638 0.607
## 2 watershed soiltype Enhance, bi- 0.638 0.386
## 3 elevation soiltype Enhance, bi- 0.607 0.386
## # ℹ 1 more variable: `Variable1 and Variable2 interact Q-statistics` <dbl>