The excel file named "schistosomiasis_count_data.xlsx" contains the demographic features and the "Dataset.xlsx" is the dataset after extra columns where certain variables were grouped and it is what was used for analysis.
The Rmd file contains the model codes used. Six distinct models under different scenarios were used.
The python (jupyter) file contains codes used for slicing dataframes, descriptive statistics and fugure plotting
The standard Poisson, negative binomial (NB), zero-inflated Possion (ZIP), zero-inflated negative binomial (ZINB), hurdle Poisson (HP), hurdle negative binomial (HNB) models are the various models used.
Their corresponding R syntaxes are showing the table that follows below
Model | Syntax |
---|---|
Poisson | glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data) |
NB | glm("dependent variable ~ x1 + x2 + ... + xn", family = negative.binomial(1), data = count_data |
ZIP | zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson") |
ZINB | zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin") |
HP | hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson") |
HNB | hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin") |
The results output of the each model is generated by "summary(model)"
# using the Possion model
poisson <- glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data)
summary(poisson)
The libraries are readxl, mass, pscl
In the "ModelComparison.Rmd" file, "count_data" is the name assigned to the schistosomiasis count data.
-
The data contains the columns; "Age", "Sex", "Class", "Community", "S_haematobium" , "S_mansoni", "Parent_Occupation", "Pipe_borne", "Tanker_treated", "Tanker_Untreated", "River_Stream", "Well_Borehole", "Age_group", "EduLevel", "Area", "schistosomiasis"
-
The five predictors used are; Age_group, Sex, Educational level, Area and Parent's occupation and ten predictors used are; Age_group, Sex, Educational level, Area, Parent's occupation, Pipe_borne, Tanker_treated, Tanker_Untreated, River_Stream and Well_Borehole
-
The different intensity data used are; the all intensity (all sample data) and the low intensity count data named "count_data" and "low_intensity_data" respectively in the Rmd file.
-
The models are assigned names in the format
{modelname}.model_{intensity data type}{number of variable}
. For example, if the Poisson model is applied using all intensity data and 10 predictors then we have the name assign to such syntax aspoisson.model_all10
. For a negative binomial applied with the low intensity data with 5 predictors, we will havenegbin.model_all10
. -
The AIC and BIC are derived with the syntax below
AIC({assigned model name})
andBIC({assigned model name})
respectively.
Model predicted and observed values are derived from ROOTOGRAM model where its output is a plot of the expected values overlayed on the observed values but when an argument plot = FALSE
is set, the values in a table form is assigned to a name given
For the observed and expected values for the model name poisson.model_all10
and others, the observed and expected value is derived with the syntax below
poisson_results <- rootogram(poisson.model_all10, max = 50, style = "standing", plot = FALSE)
negbin_results <- rootogram(negbin.model_all10, max = 50, style = "standing", plot = FALSE)
zip_results <- rootogram(zip.model_all10, max = 50, style = "standing", plot = FALSE)
zinb_results <- rootogram(zinb.model_all10, max = 50, style = "standing", plot = FALSE)
hurdlePoisson_results <- rootogram(hurdlep.model_all10, max = 50, style = "standing", plot = FALSE)
hurdleNB_results <- rootogram(hurdlenb.model_all10, max = 50, style = "standing", plot = FALSE)
and the output results are saved as an excel file with the command
write_xlsx(poisson_results,"C:\\Users\\User\\Desktop\\paper review\\model results\\poisson_results.xlsx")
Calculating the odds ratio and 95% confidence interval for each predictor variable from the hurdle NB model's results
exp(cbind(Odds_Ratio = coef(hurdlenb.model_all10), confint(hurdlenb.model_all10)))
The syntax for finding the likelihood ratio test for nested models (between using 5 and 10 predictors of the same model) to determine if the introduction of extra parameter is necessary for zero-capturing is stated below
# for Poisson between 5 predctors and 10 predictors
lrtest(poisson.model_all5, poisson.model_all10)
The required libraries are pandas (version '2.1.4'), numpy (version '1.26.3'), scipy (version '1.11.4'), statmodels (version '0.14.0'), matplotlib (version '3.8.0') and seaborn (version '0.13.2')
First, the age, community and person's class in school values were grouped into groups which are the age group (<10, 10-15 and >15), area (rural and urban) and educational level (preschool, primary and junior high) respectively.
In this section the AIC values of all scenarios used under different models were uploaded for visual comparison
The folder "model results" contains the each model's expected values.
These values are then plotted against the observed values in 1) a bars against each other in log aixs and 2) the expected values overlaid on the bars of the observed values
The model residual was calculated by deducting the observed values from the expected values. However the corresponding plots shows
and this is defined by the python code below. If
find_sqrt = lambda x: np.sqrt(x) if x>= 0 else -np.sqrt(abs(x))
poisson_r = np.array(list(map(find_sqrt, poisson_residuals))).round(0)
negbin_r = np.array(list(map(find_sqrt, negbin_residuals))).round(0)
zip_r = np.array(list(map(find_sqrt, zip_residuals))).round(0)
zinb_r = np.array(list(map(find_sqrt, zinb_residuals))).round(0)
hurdlePoisson_r = np.array(list(map(find_sqrt, hurdlePoisson_residuals))).round(0)
hurdleNB_r = np.array(list(map(find_sqrt, hurdleNB_residuals))).round(0)