Using-hurdle-models-to-analyze-schistosomiasis-count-data

The excel file named "schistosomiasis_count_data.xlsx" contains the demographic features and the "Dataset.xlsx" is the dataset after extra columns where certain variables were grouped and it is what was used for analysis.

Files

The Rmd file contains the model codes used. Six distinct models under different scenarios were used.

The python (jupyter) file contains codes used for slicing dataframes, descriptive statistics and fugure plotting

Methods

Rmd

The standard Poisson, negative binomial (NB), zero-inflated Possion (ZIP), zero-inflated negative binomial (ZINB), hurdle Poisson (HP), hurdle negative binomial (HNB) models are the various models used.

Their corresponding R syntaxes are showing the table that follows below

Model	Syntax
Poisson	glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data)
NB	glm("dependent variable ~ x1 + x2 + ... + xn", family = negative.binomial(1), data = count_data
ZIP	zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson")
ZINB	zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin")
HP	hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson")
HNB	hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin")

The results output of the each model is generated by "summary(model)"

# using the Possion model
poisson <- glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data)
summary(poisson)

The libraries are readxl, mass, pscl

ModelComparison.Rmd

In the "ModelComparison.Rmd" file, "count_data" is the name assigned to the schistosomiasis count data.

The data contains the columns; "Age", "Sex", "Class", "Community", "S_haematobium" , "S_mansoni", "Parent_Occupation", "Pipe_borne", "Tanker_treated", "Tanker_Untreated", "River_Stream", "Well_Borehole", "Age_group", "EduLevel", "Area", "schistosomiasis"
The five predictors used are; Age_group, Sex, Educational level, Area and Parent's occupation and ten predictors used are; Age_group, Sex, Educational level, Area, Parent's occupation, Pipe_borne, Tanker_treated, Tanker_Untreated, River_Stream and Well_Borehole
The different intensity data used are; the all intensity (all sample data) and the low intensity count data named "count_data" and "low_intensity_data" respectively in the Rmd file.
The models are assigned names in the format {modelname}.model_{intensity data type}{number of variable}. For example, if the Poisson model is applied using all intensity data and 10 predictors then we have the name assign to such syntax as poisson.model_all10. For a negative binomial applied with the low intensity data with 5 predictors, we will have negbin.model_all10.
The AIC and BIC are derived with the syntax below AIC({assigned model name}) and BIC({assigned model name}) respectively.

Rootograms

Model predicted and observed values are derived from ROOTOGRAM model where its output is a plot of the expected values overlayed on the observed values but when an argument plot = FALSE is set, the values in a table form is assigned to a name given

For the observed and expected values for the model name poisson.model_all10 and others, the observed and expected value is derived with the syntax below

poisson_results <- rootogram(poisson.model_all10, max = 50, style = "standing", plot = FALSE)
negbin_results <- rootogram(negbin.model_all10, max = 50, style = "standing", plot = FALSE)
zip_results <- rootogram(zip.model_all10, max = 50, style = "standing", plot = FALSE)
zinb_results <- rootogram(zinb.model_all10, max = 50, style = "standing", plot = FALSE)
hurdlePoisson_results <- rootogram(hurdlep.model_all10, max = 50, style = "standing", plot = FALSE)
hurdleNB_results <- rootogram(hurdlenb.model_all10, max = 50, style = "standing", plot = FALSE)

and the output results are saved as an excel file with the command

write_xlsx(poisson_results,"C:\\Users\\User\\Desktop\\paper review\\model results\\poisson_results.xlsx")

Odds ratio

Calculating the odds ratio and 95% confidence interval for each predictor variable from the hurdle NB model's results

exp(cbind(Odds_Ratio = coef(hurdlenb.model_all10), confint(hurdlenb.model_all10)))

Likelihood ratio test

The syntax for finding the likelihood ratio test for nested models (between using 5 and 10 predictors of the same model) to determine if the introduction of extra parameter is necessary for zero-capturing is stated below

# for Poisson between 5 predctors and 10 predictors
lrtest(poisson.model_all5, poisson.model_all10)

Analysis_and_figures_plotting.ipynb

The required libraries are pandas (version '2.1.4'), numpy (version '1.26.3'), scipy (version '1.11.4'), statmodels (version '0.14.0'), matplotlib (version '3.8.0') and seaborn (version '0.13.2')

First, the age, community and person's class in school values were grouped into groups which are the age group (<10, 10-15 and >15), area (rural and urban) and educational level (preschool, primary and junior high) respectively.

Model's AIC values

In this section the AIC values of all scenarios used under different models were uploaded for visual comparison

Observed and model's expected values

The folder "model results" contains the each model's expected values.

These values are then plotted against the observed values in 1) a bars against each other in log aixs and 2) the expected values overlaid on the bars of the observed values

Model's residuals

The model residual was calculated by deducting the observed values from the expected values. However the corresponding plots shows

$$\displaylines{ residual = expected - observed. \\\ \text{if residual is positive, then the square root is calculated by } \\\ \sqrt{residual} \\\ \text{if residual is negative, then the square root is calculated by } \\\ -\sqrt{|residual|} }$$

and this is defined by the python code below. If

find_sqrt = lambda x: np.sqrt(x) if x>= 0 else -np.sqrt(abs(x))

poisson_r = np.array(list(map(find_sqrt, poisson_residuals))).round(0)

negbin_r = np.array(list(map(find_sqrt, negbin_residuals))).round(0)

zip_r = np.array(list(map(find_sqrt, zip_residuals))).round(0)

zinb_r = np.array(list(map(find_sqrt, zinb_residuals))).round(0)

hurdlePoisson_r = np.array(list(map(find_sqrt, hurdlePoisson_residuals))).round(0)

hurdleNB_r = np.array(list(map(find_sqrt, hurdleNB_residuals))).round(0)

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
using-hurdle-models		using-hurdle-models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using-hurdle-models-to-analyze-schistosomiasis-count-data

Files

Methods

Rmd

ModelComparison.Rmd

Rootograms

Odds ratio

Likelihood ratio test

Analysis_and_figures_plotting.ipynb

Model's AIC values

Observed and model's expected values

Model's residuals

About

Releases

Packages

Languages

kojonketia/Using-hurdle-models-to-analyze-schistosomiasis-count-data

Folders and files

Latest commit

History

Repository files navigation

Using-hurdle-models-to-analyze-schistosomiasis-count-data

Files

Methods

Rmd

ModelComparison.Rmd

Rootograms

Odds ratio

Likelihood ratio test

Analysis_and_figures_plotting.ipynb

Model's AIC values

Observed and model's expected values

Model's residuals

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages