Skip to content

Features

Joseph Luchman edited this page Aug 24, 2024 · 3 revisions

Features

By default, domir() will use all the names parsed from the formula or formula_list as separate ‘value generating names’. This means that each name in the list is ascribed its own component of the returned value and this value will be adjusted for the overlap the name has with other names. In practice when applied to a predictive model, this means that an independent variable (i.e., value generating name) is ascribed a component of the fit metric associated with the final model and that the value ascribed the independent variable is adjusted for covariance with other independent variables in the model.

The default method associated with domir() is illustrated in the linear model-based dominance analysis below. This analysis compares the sale price of two indexes of house size (First_Flr_SF and Second_Flr_SF) with two indexes of the wear and tear of the house (Year_Built and Overall_Cond) with one index noting the design of the house (House_Style).

domir(
  Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
  \(fml) {
    lm(fml, data = modeldata::ames) |>
      performance::r2() |>
      _[[1]]
  }, 
  .cdl = FALSE, .cpt = FALSE
)
## Overall Value:      0.7187646 
## 
## General Dominance Values:
##               General Dominance Standardized Ranks
## First_Flr_SF         0.33707532   0.46896486     1
## Second_Flr_SF        0.07858055   0.10932724     3
## Year_Built           0.17628733   0.24526437     2
## Overall_Cond         0.06582123   0.09157550     4
## House_Style          0.06100013   0.08486803     5

In the above analysis, all five value generating names/independent variables receive their own component of the $R^2$. It is also important to note that this dominance analysis involved the computation of $2^{5} = 32$ calls to the anonymous function that fit an lm() and extracted the $R^2$.

There are several features of domir() that change this default behavior and produce a ‘constrained dominance analysis’ (Azen & Budescu, 2003) where a component of the dominance analysis is altered such that specific names are no longer compared to the other names in the same way.

High Priority Covariates

One way in which a user can estimate a constrained dominance analysis is using the .all argument. This option changes the manner in which the formula names are parsed such that names put into the .all argument are given priority for being ascribed the value returned by the function.

The process of increasing the priority of one or more names in the formula results in those increased priority names’ value being captured in a separate sub-model with none of the other names from the formula. In this way, the high priority names are not adjusted for overlap with one another (if multiple names are included in .all) or other names in the formula not included in .all.

Considering again the Ames housing example, we might not want a factor like the year the house was built to be ‘in competition’ with the other factors in the model as it wouldn’t make sense to include in the model after any of the other factors conceptually for the research question. Thus, we are interested in evaluating what the other factors’ dominance results are when giving the year the house was built high priority in the model.

domir(
  Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
  \(fml) {
    lm(fml, data = modeldata::ames) |>
      performance::r2() |>
      _[[1]]
  }, 
  .all = ~ Year_Built,
  .cdl = FALSE, .cpt = FALSE
)
## Overall Value:      0.7187646 
## All Subset Value:   0.3118397 
## 
## General Dominance Values:
##               General Dominance Standardized Ranks
## First_Flr_SF         0.26532070   0.36913436     1
## Second_Flr_SF        0.07214273   0.10037046     2
## Overall_Cond         0.02347367   0.03265835     4
## House_Style          0.04598774   0.06398165     3

When used as a high-priority name, Year_Built is removed from the generation of all combinations of names when creating all sub-models which results in a total of $2^4 + 1 = 17$ sub-models estimated where the added 1 is the sub-model associated with Year_Built (i.e., the names in .all).

When including high priority names in .all, the dominance analysis is constrained as these names are included in all other sub-models but the value associated with each of those sub-models is adjusted such that the value associated with the high priority names is subtracted from it. Note here that the $R^2$ value from the model lm(Sale_Price ~ Year_Built, data = modeldata::ames) is 0.3118397 or the all subset value.

The dominance statistics produced when using a high-priority name are also adjusted for the all subset value and, as can be seen with the Standardized result vector, no longer sum to 1 as a component of the overall value is ascribed to the all subset value.

The use of high priority names in .all then very literally ‘gives priority’ to these names by including them first in the model and not adjusting the value they obtain for any other names–but all other names’ values are adjusted for the high-priority names’ value.

Equal Priority Unions/Groups

Another way in which dominance analysis can be constrained is in using the .set argument. This option, like .all, changes the manner in which the formula names are parsed. .set differs from .all as .set takes a list of separate formulas and binds the names in the separate formulas together to form a union or group of those names.

The process of generating unions of names is that all the names in the union are given equal priority and are always included together in a sub-model. This results in there being no adjustments for overlap between the names within a union but these names are adjusted for overlap with other names or name unions.

In the Ames housing example, we might reconsider the role of Year_Built by giving it equal priority with Overall_Cond as indicators of the condition of the home. Similarly, we might give the two square footage variables equal priority as indicators of the size of the home. In making unions of these variables, we are effectively using them as though they are a single name in the dominance analysis. In fact, by naming the formulas in the list, they will literally have a shared name in the analysis.

domir(
  Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
  \(fml) {
    lm(fml, data = modeldata::ames) |>
      performance::r2() |>
      _[[1]]
  }, 
  .set = 
    list(size = ~ First_Flr_SF + Second_Flr_SF, condit = ~ Year_Built + Overall_Cond),
  .cdl = FALSE, .cpt = FALSE
)
## Overall Value:      0.7187646 
## 
## General Dominance Values:
##             General Dominance Standardized Ranks
## House_Style        0.03713255   0.05166163     3
## size               0.45649084   0.63510484     1
## condit             0.22514116   0.31323353     2

When unionizing, the combinations of names needed to create all sub-models is the number of individual names and unions of names or $2^3 = 6$ sub-models. Combining names in this way then can substantially reduce the number of sub-models required to compute dominance statistics and make dominance designations.

Comparing these results to the first dominance analysis with no name unions, the results tend to be somewhat consistent with what would be expected if we were to sum the general dominance statistics in that First_Flr_SF and Second_Flr_SF would have had the highest value when summed, followed by the sum of Year_Built and Overall_Cond, and then House_Style. The value that was obtained by the size union in the unionized dominance analysis is larger than would be expected given the separate values obtained by each of the names that comprise it. Correspondingly, the value obtained by the condit union and House_Style were smaller than in the non-unionized analysis.

The differences between these unionized and non-unionized results extend from how the equal prioritization given several of the variables affects the overall extent of overlap between names and name unions. The key point to note is that names placed into a union are not adjusted for overlap between one another. Ignoring overlap between names is more impactful when the names are more strongly correlated.

more to come

Clone this wiki locally