03-methods.qmd

# Model Design and Methodology {#sec-methods}


```{r packages, file = "R/chapter_start.R", include = FALSE, cache = FALSE}
library(tidyverse)
library(targets)
library(kableExtra)
```

## Model Design

To examine the effects of parameter input sensitivity, we adapted a
trip-based travel demand model from the Roanoke
Valley Transportation Planning Organization
[(RVTPO)](https://github.com/xinwangvdot/rvtpo). The RVTPO model
provides an ideal testing environment for this research because it uses
an integrated mode and destination choice framework common in more
advanced trip-based models. At the same time, its small size
(approximately 215 zones) means the entire model runs in a few minutes 
and thus allows for efficient testing of multiple model runs.

The total passenger trips $T$ traveling from zone $i$ to zone $j$ on the
highway in a period $t$ is 
$$
T_{ijt} = P_i * \mathcal{P}_{\mathrm{auto}}(\beta, C_{ijt}) * \mathcal{P}_j(\gamma, A_j, MCLS_{ijt}) * \Delta_t 
$$ {#eq-trips} 
where $P$ is the productions at zone $i$;
$\mathcal{P}_{\mathrm{car}}$ is the car mode choice probability
determined by utility parameters $\beta$ and the travel costs $C$
between $i$ and $j$ at time period $t$; $\mathcal{P}_{j}$ is the
destination choice probability of choosing destination $j$ given the
utility parameters $\gamma$, attractions $A$, and the impedance as the
mode choice model logsum $MCLS_{ijt}$. A time-of-day and direction
factor $\Delta$ finalizes the total assigned trips.

The productions $P_i$, and attractions $A_j$ were extracted from the
RVTPO Model and held constant. The attractions are determined from the
socioeconomic (SE) data. The SE data included information by TAZ for the
total population, number of households, total workers, and workers by
employment type. The trip productions are organized by TAZ and trip
purpose. The trip purposes used in this model are Home Based Work (HBW),
Home Based Other (HBO), Non-Home Based (NHB), Commercial Vehicles (CV),
Internal-External (IXXI), and External-External (XX). Only the first
three are analysed, but all of the purposes are assigned to the network.
CV, IXXI, and XX trips were kept fixed for this analysis.

The two parameter vectors $\beta$ and $\gamma$ describe the mode choice
model and destination choice model coefficients, respectively. Mode
choice estimates how many trips from $i$ to $j$ will happen on each
available mode $k$.This model analyses three modes of transportation:
auto, non-motorized, and transit. The mode by which a trip is made is
determined by calculated utilities for the three modes. These utilities
take inputs from parameter values and time and distance skims $X$. Skims
are either the time or distance to travel between zone pairs. Travel
time for auto used the single occupancy vehicle peak time, non-motorized
travel time used the distance skim multiplied by a factor of average
walking speed (3 mph), and transit time used the walk to bus peak time.
The mode choice parameters (constants and coefficients) were also
obtained from the RVTPO model. These values are shown in
@tbl-choicecoeff.

```{r choicecoeff, echo = FALSE}
#| label: tbl-choicecoeff
#| tbl-cap: Choice model parameters
tar_load(mc_coeff)
tar_load(mc_const)
tar_load(dc_coeff)


coefficient_table <- bind_rows(
  mc_coeff |> mutate(model = "Mode choice") |> 
    select(Name, HBW, HBO, NHB) |> 
    filter(Name %in% c("CIVTT", "CCOST", "CWALK1", "AUTOCOST")) |> 
    mutate(Name = c("In-vehicle travel time", "Travel cost", "Walk distance", "Auto operating cost (cents/mile)")) |> 
    mutate(variable = c("$\\beta_{ivtt}$", "$\\beta_{tc}$", "$\\beta_{wd}$", "$\\beta_{ac}$")),
  mc_const |> mutate(model = "Mode choice") |> 
    select(Name, HBW, HBO, NHB) |> 
    filter(Name %in% c("K_TRN", "K_NMOT")) |> 
    mutate(Name = c("Transit constant", "NonMotorized constant")) |> 
    mutate(variable = c("$k_{trn}$", "$k_{nmot}$")) ,
  dc_coeff |> mutate(model = "Destination choice") |> 
    select(Name = VAR, HBW, HBO, NHB) |> 
    filter(Name %in% c("HH", "I(OTH_EMP + OFF_EMP)", "OFF_EMP", "OTH_EMP","RET_EMP")) |> 
    mutate(Name = c("Households", "Other + Office", "Office", "Other", "Retail")) |> 
    mutate(variable = c("$\\gamma_{hh}$", "$\\gamma_{oth + off}$", "$\\gamma_{off}$", "$\\gamma_{oth}$", "$\\gamma_{ret}$"))
  )%>%
  select(c(Name, variable, HBW, HBO, NHB))
kbl(coefficient_table, booktabs = TRUE, 
    col.names = c("", "Variable","HBW", "HBO", "NHB"), digits = 4, escape = F) %>%
  pack_rows(index = c("Mode Choice Coefficients" = 4, "Mode Choice Constants" = 2, "Destination Choice Parameters" = 5))

```

The utility equations for the mode choice model are as follows: 
$$
\begin{aligned}
U_{auto} &= \beta_{ivtt} * X_{auto} + \beta_{tc} * \beta_{ac} * X_{dist}\\
U_{nmot} &= k_{nmot} + 20 * \beta_{wd}*X_{nmot}\\
U_{trn} &= k_{trn} + \beta_{ivtt} * X_{trn}
\end{aligned}
$$ 
These utilities are used to calculate the MCLS by: 
$$
MCLS_{ij} = \ln\left(\sum_{k \in K} e^{U_{ijk}}\right).
$$ {#eq-mcls}
If the distance was greater than 2 miles, non-motorized
travel was excluded.

This logsum value is then used as the primary impedance for a
destination choice model [@ben-akiva1985]. Destination choice estimates
travel patterns based on mode choice, trip generators (workers and
households), and destination choice parameters. These parameter values
are also shown in @tbl-choicecoeff. The destination choice utility is
the primary impedance (mode choice logsum value) plus the natural log of
the size term, where the sized term is calculated as: 
$$
A_j = \gamma_{hh} * \mathrm{HH} + \gamma_{off} * \mathrm{OFF} + \gamma_{ret} * \mathrm{RET} + \gamma_{oth} * \mathrm{OTH} + \\ \gamma_{oth+off} * \mathrm{OFFOTH}
$$ {#eq-dcsizeterm} 
HH is the total households in zone $j$. OFF, RET,
and OTH are the jobs in zone $j$ by employment type office, retail, and
other respectively. The destination choice utility is then transformed
into a destination choice logsum value with: 
$$
DCLS = \ln \left(\sum_{j \in J} e^{\ln(A_j) + 1* MCLS_{ij}}\right)
$$ {#eq-dcls}

The probability of both the mode choice and destination choice are
calculated using the exponentiated utility divided by the corresponding
logsum. These probabilities in conjunction with the trip productions can
calculate the number of production-attraction (PA) trips between each
zone by each mode and purpose. The auto trips are calculated by
multiplying the probability of the destination by PA pair, the
productions for each origin, and the probability of an auto mode choice
by PA pair. This results in PA auto trips. The same process is followed
for the other two modes. These PA trips are converted into origin
destination (OD) trips by multiplying the trips by corresponding time of
day factors (see #eq-trips). These trips are calculated using Bentley's
CUBE and the RVTPO model. The trips, by time period, are assigned to the
highway network by the shortest path by time using free flow speed and
with link capacity as a restriction.

## Uncertainty Design

Within the mode and destination models there exists uncertainty within
the parameters in @tbl-choicecoeff. Sampling methods can take the
defined uncertainty and choose potential parameter values within the
possible range. Two common methods for parameter sampling include, Monte
Carlo (MC) simulation and Latin hypercube sampling (LHS). MC simulation
draws independently from multiple distributions, while LHS makes draws
that cover the parameter space more efficiently and can capture the
joint distribution between two or more parameter values [@helton2003].
As a result, LHS can reduce the number of draws needed to fully
re-create the statistical variance in a model, but the amount of
reduction is unknown and may not be universal to all problems
[@yang2013].

With the trip-based model described above, MC and LHS methods were used
to develop alternative parameter sets to evaluate uncertainty. To
identify a standard deviation for each parameter, a coefficient of
variation was used. A set coefficient of variation of 0.10 was used for
the four mode choice coefficients and the destination choice parameters.
The mode choice constants were kept the same across all iterations.
Literature had identified a coefficient of variation of 0.30, but for
this analysis that caused an unrealistic value of time, and thus it was
changed to be 0.10 [@zhao2002]. Value of time is a ratio in units of
money per time that should be compared to the regional wage rate. Using
a $c_v$ of 0.30 the value of time range was from $\$2$ to $\$32$ /hr,
whereas using a $c_v$ of 0.10 the range was $\$6$ to $\$14$ /hr. The
latter seemed more rational because it is related to wage rates and thus
a $c_v$ of 0.10 was used for our analysis. The standard deviation was
equal to 0.10 multiplied by the mean, where the mean values in this
situation are the base scenario parameters (as identified in
@tbl-choicecoeff ).

The MC random sampling uses the R function of `rnorm`. LHS uses the
`lhs` package in R. Since this package only chooses variables on a zero
to one scale, the values given use a function to put the random sampling
on the right scale needed for the given parameter. The full code for
both methods can be found in a public [GitHub
repository](https://github.com/natmaegray/sensitivity_thesis). One
hundred and 600 draws of random samples for both methods are generated.
With these generated parameters, the mode choice model step was run for
every set of input parameters for each purpose. The average MCLS value
for each run was determined to compare each continuous draw. This
allowed us to see how many iterations of which sampling type would be
sufficient to show a full range of possible outcomes.

The parameters generated were compared for both sampling methods.
@fig-parameter shows the distributions for the HBW parameters when using
100 and 600 draws. These distributions show that LHS gives normally
distributed parameters with fewer draws than MC sampling: at 100 draws
LHS shows a nearly perfect normal distribution, where there are some
discrepancies for the MC generated parameters. These Figures show that
LHS is likely to estimate the full variance of the results with fewer
draws.

```{r parameter100, echo = FALSE}
#| label: fig-parameter
#| fig-cap: "Sampled mode and destination choice parameters for HBW trip purpose."
#| fig-subcap:
#|   - "100 draws"
#|   - "600 draws"
#| layout-nrow: 2
tar_load(hbw_mc_coeff_lists_100)
list100 <- hbw_mc_coeff_lists_100

Z100 <- bind_rows(list100) %>% 
  mutate(method = c("base", rep("MC", 100), rep("LHS", 100))) %>% 
  mutate(vot = (ivtt/ccost) * (60/100),
         ccost = 100 * ccost) %>%
  pivot_longer(cols = -c("method"))

facet_names <- c('autocost' = "Auto operating cost",
                 'ccost' = "Travel cost (per $)",
                 "ivtt" = "In-vehicle travel time",
                 "vot" = "Value of Time",
                 "walk1" = "Walk distance")

z1 <- Z100 %>%
  filter(!(method %in% "base")) %>%
  filter(!(name %in% c("k_nmot", "k_trn", "walk2"))) %>%
  ggplot() +
  aes(x = value, color = method, fill = method) +
  geom_density(alpha = 0.25) +
  scale_fill_hue(direction = 1) +
  labs(y = "Density",
       x = "Value") +
  theme_bw() +
  theme(legend.position = "bottom") +
  facet_wrap(vars(name), scales = "free", labeller = as_labeller(facet_names)) + 
  scale_y_continuous(breaks = )

tar_load(hbw_mc_coeff_lists_600)
list600 <- hbw_mc_coeff_lists_600

Z600 <- bind_rows(list600) %>% 
  mutate(method = c("base", rep("MC", 600), rep("LHS", 600))) %>%
  mutate(vot = (ivtt/ccost) * (60/100),
         ccost = 100 * ccost) %>%
  pivot_longer(cols = -c("method"))


z6 <- Z600 %>%
  filter(!(method %in% "base")) %>%
    filter(!(name %in% c("k_nmot", "k_trn", "walk2"))) %>%
  ggplot() +
  aes(x = value, color = method, fill = method) +
  geom_density(alpha = 0.25) +
  scale_fill_hue(direction = 1) +
  labs(y = "Density",
       x = "Value") +
  theme_bw() +
  theme(legend.position = "bottom") +
  facet_wrap(vars(name), scales = "free", labeller = as_labeller(facet_names))

z1
z6
```

To determine if LHS is effective at a reasonable amount of iterations,
the cumulative mean and the cumulative standard deviation of the average
MCLS value for every zone (see @eq-mcls ) was calculated for each
additional draw for both sampling methods. MCLS is an impedance term
which is an important value for destination choice and region routing.
The average MCLS, $x$, was used as a measure of outcome possibilities to
simplify a complex term as a single value to compare by across all
iterations. The cumulative mean is calculated as: $$
\mu_i = \frac{x_1 + ... + x_i}{n}
$$ {#eq-cmclsmean} and the cumulative standard deviation is calculated
as: $$
SD_i = \sqrt{\frac{\sum (x_i - \mu_i)^2 }{n-1}}.
$$ {#eq-sdi} The cumulative mean shows how the average MCLS stabilizes
across each iteration, and the cumulative standard deviation is used to
show the 95% confidence interval of that mean. When the cumulative mean
for the draws stabilizes, that shows that the amount of generated
parameters has captured the possible variance of the results. This is
shown for two of the three trip purposes in @fig-cm.

```{r fig-cm}
#| label: fig-cm
#| layout-ncol: 1
#| layout-nrow: 2
#| fig-cap: Average mode choice logsum (impedance) cumulative mean and 95% confidence interval with 100 and 600 draws.
#| fig-subcap: 
#|   - "HBW"
#|   - "HBO"
#| fig-env: sidewaysfigure

## HBW
tar_load(hbw_stats_100)
tar_load(hbw_stats_600)
hbw_stats <- bind_rows(hbw_stats_100, hbw_stats_600, .id = "ndraws") %>%
  mutate(ndraws = case_when(ndraws == "1" ~ "100 Draws",
                          ndraws == "2" ~ "600 Draws"))

hbw_cm <- hbw_stats %>%
  filter(type != "base") %>%
  ggplot() +
    aes(x = draw, y = cummean, ymin = cummean - 1.96*cumvar, ymax = cummean + 1.96*cumvar, colour = type, fill = type, group = type) +
    geom_ribbon(alpha = 0.2, colour = NA) +
    geom_line(size = 0.5) +
#    ylim(0.025, 0.15) +
    labs(x = "Draw", 
         y = "Cumulative Mean of Average MCLS",
         color = "method") +
    scale_color_hue(direction = 1) +
    theme_bw() +
    theme(legend.position = "bottom") +
    facet_wrap(vars(ndraws), scales = "free_x")

## HBO
tar_load(hbo_stats_100)
tar_load(hbo_stats_600)
hbo_stats <- bind_rows(hbo_stats_100, hbo_stats_600, .id = "ndraws") %>%
  mutate(ndraws = case_when(ndraws == "1" ~ "100 Draws",
                          ndraws == "2" ~ "600 Draws"))

hbo_cm <- hbo_stats %>%
  filter(type != "base") %>%
  ggplot() +
    aes(x = draw, y = cummean, ymin = cummean - 1.96*cumvar, ymax = cummean + 1.96*cumvar, colour = type, fill = type, group = type) +
    geom_ribbon(alpha = 0.2, colour = NA) +
    geom_line(size = 0.5) +
#    ylim(0.025, 0.15) +
    labs(x = "Draw", 
         y = "Cumulative Mean of Average MCLS",
         color = "method") +
    scale_color_hue(direction = 1) +
    theme_bw() +
    theme(legend.position = "bottom") +
    facet_wrap(vars(ndraws), scales = "free_x")

## NHB
tar_load(nhb_stats_100)
tar_load(nhb_stats_600)
nhb_stats <- bind_rows(nhb_stats_100, nhb_stats_600, .id = "ndraws") %>%
  mutate(ndraws = case_when(ndraws == "1" ~ "100 Draws",
                          ndraws == "2" ~ "600 Draws"))

nhb_cm <- nhb_stats %>%
  filter(type != "base") %>%
  ggplot() +
    aes(x = draw, y = cummean, ymin = cummean - 1.96*cumvar, ymax = cummean + 1.96*cumvar, colour = type, fill = type, group = type) +
    geom_ribbon(alpha = 0.2, colour = NA) +
    geom_line(size = 0.5) +
#    ylim(0.025, 0.15) +
    labs(x = "Draw", 
         y = "Cumulative Mean of Average MCLS",
         color = "method") +
    scale_color_hue(direction = 1) +
    theme_bw() +
    theme(legend.position = "bottom") +
    facet_wrap(vars(ndraws), scales = "free_x")

## Plot
hbw_cm
hbo_cm
```

For all three trip purposes, both sampling methods had a stabilized mean
by 100 draws. The LHS methods standard deviation ribbon was generally
thinner than the MC method. From the narrowed cumulative standard
deviation, and that the parameter values are better normally distributed
when using LHS, that method of sampling was used for the assignment
analysis of the model. Since LHS captures the possible variance at a
small enough number of iterations, it can be used for large
transportation demand models. From these results it was decided to use
100 LHS samples parameters to evaluate uncertainty within each step of
the model. The next chapter includes the results of applying these
sampled parameters to the travel demand model.