Equivalent Q Values Within Statistical Variance and Linear Regression Framework

1. Q Values

2. Linear Regression Model with Dummy Variables

Consider a linear regression model where the response variable y depends on k dummy variables D₁, D₂, …, D_k. Each dummy variable D_j indicates membership in a particular category or group, taking the value 1 if an observation belongs to category j and 0 otherwise. The model is expressed as:

y_i = β₀ + β₁D_i1 + β₂D_i2 + ⋯ + β_kD_ik + ϵ_i

where: y_i is the response variable for the i-th observation, β₀ is the intercept term, β_j is the coefficient associated with dummy variable D_j, D_ij is the value of dummy variable D_j for the i-th observation, which is 1 if the i-th observation belongs to category j and 0 otherwise, ϵ_i is the error term.

3. Behavior of Dummy Variables

Each observation belongs to exactly one of the categories represented by the dummy variables. If the i-th observation belongs to category j, then D_ij = 1 and all other dummy variables are 0 for that observation. Thus, the regression equation for observation i belonging to category j simplifies to:

y_i = β₀ + β_j + ϵ_i

This equation holds for all observations i that belong to category j.

4. Predicted Value for a Category

The predicted value ŷ_j for an observation in category j is the expected value of y_i, given that D_ij = 1 and the other dummy variables are 0. Since the error term ϵ_i has an expected value of 0, the predicted value for category j is:

ŷ_j = 𝔼(y_i|D_ij = 1) = β₀ + β_j

5. Ordinary Least Squares (OLS) Estimation

In linear regression, the coefficients β₀, β₁, …, β_k are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals:

$$ \text{Minimize } \sum_{i=1}^n (y_i - \widehat{y}_i)^2 $$

The OLS solution leads to the estimate of β_j for each dummy variable such that the predicted value ŷ_j = β₀ + β_j represents the mean value of y_i for all observations in category j. Specifically, OLS ensures that:

$$ \widehat{y}_j = \frac{1}{N_j} \sum_{i: D_{ij} = 1} y_i = \bar{y}_j $$

where N_j is the number of observations in category j, and ȳ_j is the mean response for category j.

6. Conclusion

$$ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} = \frac{\sum\limits_{i = 1}^{N}\left({y_i - \widehat{y}_i}^2\right)}{\sum\limits_{i = 1}^{N}\left({y_i - \overline{y}}^2\right)} = R^2 $$

7. Real-world example

library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(sesp)
library(gdverse)

NTDs = sf::st_as_sf(gdverse::NTDs, coords = c('X','Y'))

system.time({
go1 = sesp(incidence ~ ., data = NTDs, discvar = 'none',
           model = 'ols', overlay = 'intersection')
})
##    user  system elapsed 
##    0.22    0.03    0.65

system.time({
go2 = gd(incidence ~ ., data = NTDs,
         type = c('factor','interaction'))
})
##    user  system elapsed 
##    0.04    0.00    0.08

Factor detector

go1$factor
## # A tibble: 3 × 5
##   Variable  Qvalue    AIC    BIC LogLik
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>
## 1 watershed  0.638 -10.0  -10.0   15.0 
## 2 elevation  0.607   1.18   1.18   7.41
## 3 soiltype   0.386  79.7   79.7  -33.8
go2$factor
## # A tibble: 3 × 3
##   variable  `Q-statistic` `P-value`
##   <chr>             <dbl>     <dbl>
## 1 watershed         0.638  0.000129
## 2 elevation         0.607  0.0434  
## 3 soiltype          0.386  0.372

Interaction detector

go1$interaction
## # A tibble: 3 × 7
##   Variable              Interaction    Qv1   Qv2  Qv12 Variable1 Variable2
##   <chr>                 <chr>        <dbl> <dbl> <dbl> <chr>     <chr>    
## 1 watershed ∩ elevation Enhance, bi- 0.638 0.607 0.714 watershed elevation
## 2 watershed ∩ soiltype  Enhance, bi- 0.638 0.386 0.736 watershed soiltype 
## 3 elevation ∩ soiltype  Enhance, bi- 0.607 0.386 0.664 elevation soiltype
go2$interaction
## # A tibble: 3 × 6
##   variable1 variable2 Interaction  `Variable1 Q-statistics` `Variable2 Q-statistics`
##   <chr>     <chr>     <chr>                           <dbl>                    <dbl>
## 1 watershed elevation Enhance, bi-                    0.638                    0.607
## 2 watershed soiltype  Enhance, bi-                    0.638                    0.386
## 3 elevation soiltype  Enhance, bi-                    0.607                    0.386
## # ℹ 1 more variable: `Variable1 and Variable2 interact Q-statistics` <dbl>