-
Notifications
You must be signed in to change notification settings - Fork 2
Features
By default, domir()
will use all the names parsed from the formula
or formula_list
as separate ‘value generating names’. This means that
each name in the list is ascribed its own component of the returned
value and this value will be adjusted for the overlap the name has with
other names. In practice when applied to a predictive model, this means
that an independent variable (i.e., value generating name) is ascribed a
component of the fit metric associated with the final model and that the
value ascribed the independent variable is adjusted for covariance with
other independent variables in the model.
The default method associated with domir()
is illustrated in the
linear model-based dominance analysis below. This analysis compares the
sale price of two indexes of house size (First_Flr_SF and
Second_Flr_SF) with two indexes of the wear and tear of the house
(Year_Built and Overall_Cond) with one index noting the design of
the house (House_Style).
domir(
Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
\(fml) {
lm(fml, data = modeldata::ames) |>
performance::r2() |>
_[[1]]
},
.cdl = FALSE, .cpt = FALSE
)
## Overall Value: 0.7187646
##
## General Dominance Values:
## General Dominance Standardized Ranks
## First_Flr_SF 0.33707532 0.46896486 1
## Second_Flr_SF 0.07858055 0.10932724 3
## Year_Built 0.17628733 0.24526437 2
## Overall_Cond 0.06582123 0.09157550 4
## House_Style 0.06100013 0.08486803 5
In the above analysis, all five value generating names/independent
variables receive their own component of the lm()
and
extracted the
There are several features of domir()
that change this default
behavior and produce a ‘constrained dominance analysis’ (Azen & Budescu,
2003) where a component of the dominance analysis is altered such that
specific names are no longer compared to the other names in the same
way.
One way in which a user can estimate a constrained dominance analysis is
using the .all
argument. This option changes the manner in which the
formula names are parsed such that names put into the .all
argument
are given priority for being ascribed the value returned by the
function.
The process of increasing the priority of one or more names in the
formula results in those increased priority names’ value being captured
in a separate sub-model with none of the other names from the formula.
In this way, the high priority names are not adjusted for overlap with
one another (if multiple names are included in .all
) or other names in
the formula not included in .all
.
Considering again the Ames housing example, we might not want a factor like the year the house was built to be ‘in competition’ with the other factors in the model as it wouldn’t make sense to include in the model after any of the other factors conceptually for the research question. Thus, we are interested in evaluating what the other factors’ dominance results are when giving the year the house was built high priority in the model.
domir(
Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
\(fml) {
lm(fml, data = modeldata::ames) |>
performance::r2() |>
_[[1]]
},
.all = ~ Year_Built,
.cdl = FALSE, .cpt = FALSE
)
## Overall Value: 0.7187646
## All Subset Value: 0.3118397
##
## General Dominance Values:
## General Dominance Standardized Ranks
## First_Flr_SF 0.26532070 0.36913436 1
## Second_Flr_SF 0.07214273 0.10037046 2
## Overall_Cond 0.02347367 0.03265835 4
## House_Style 0.04598774 0.06398165 3
When used as a high-priority name, Year_Built is removed from the
generation of all combinations of names when creating all sub-models
which results in a total of .all
).
When including high priority names in .all
, the dominance analysis is
constrained as these names are included in all other sub-models but
the value associated with each of those sub-models is adjusted such that
the value associated with the high priority names is subtracted from it.
Note here that the lm(Sale_Price ~ Year_Built, data = modeldata::ames)
is 0.3118397 or
the all subset value.
The dominance statistics produced when using a high-priority name are also adjusted for the all subset value and, as can be seen with the Standardized result vector, no longer sum to 1 as a component of the overall value is ascribed to the all subset value.
The use of high priority names in .all
then very literally ‘gives
priority’ to these names by including them first in the model and not
adjusting the value they obtain for any other names–but all other names’
values are adjusted for the high-priority names’ value.
Another way in which dominance analysis can be constrained is in using
the .set
argument. This option, like .all
, changes the manner in
which the formula names are parsed. .set
differs from .all
as .set
takes a list of separate formulas and binds the names in the separate
formulas together to form a union or group of those names.
The process of generating unions of names is that all the names in the union are given equal priority and are always included together in a sub-model. This results in there being no adjustments for overlap between the names within a union but these names are adjusted for overlap with other names or name unions.
In the Ames housing example, we might reconsider the role of Year_Built by giving it equal priority with Overall_Cond as indicators of the condition of the home. Similarly, we might give the two square footage variables equal priority as indicators of the size of the home. In making unions of these variables, we are effectively using them as though they are a single name in the dominance analysis. In fact, by naming the formulas in the list, they will literally have a shared name in the analysis.
domir(
Sale_Price ~ First_Flr_SF + Second_Flr_SF + Year_Built + Overall_Cond + House_Style,
\(fml) {
lm(fml, data = modeldata::ames) |>
performance::r2() |>
_[[1]]
},
.set =
list(size = ~ First_Flr_SF + Second_Flr_SF, condit = ~ Year_Built + Overall_Cond),
.cdl = FALSE, .cpt = FALSE
)
## Overall Value: 0.7187646
##
## General Dominance Values:
## General Dominance Standardized Ranks
## House_Style 0.03713255 0.05166163 3
## size 0.45649084 0.63510484 1
## condit 0.22514116 0.31323353 2
When unionizing, the combinations of names needed to create all
sub-models is the number of individual names and unions of names or
Comparing these results to the first dominance analysis with no name unions, the results tend to be somewhat consistent with what would be expected if we were to sum the general dominance statistics in that First_Flr_SF and Second_Flr_SF would have had the highest value when summed, followed by the sum of Year_Built and Overall_Cond, and then House_Style. The value that was obtained by the size union in the unionized dominance analysis is larger than would be expected given the separate values obtained by each of the names that comprise it. Correspondingly, the value obtained by the condit union and House_Style were smaller than in the non-unionized analysis.
The differences between these unionized and non-unionized results extend from how the equal prioritization given several of the variables affects the overall extent of overlap between names and name unions. The key point to note is that names placed into a union are not adjusted for overlap between one another. Ignoring overlap between names is more impactful when the names are more strongly correlated.
more to come