Skip to content

The count data and model codes for analysis for the paper "Using zero-inflated and hurdle models to analyze schistosomiasis count data of school children in the southern areas of Ghana" are presented here. The codes are a combination of Python and R codes.

Notifications You must be signed in to change notification settings

kojonketia/Using-hurdle-models-to-analyze-schistosomiasis-count-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 

Repository files navigation

Using-hurdle-models-to-analyze-schistosomiasis-count-data

The excel file named "schistosomiasis_count_data.xlsx" contains the demographic features and the "Dataset.xlsx" is the dataset after extra columns where certain variables were grouped and it is what was used for analysis.

Files

The Rmd file contains the model codes used. Six distinct models under different scenarios were used.

The python (jupyter) file contains codes used for slicing dataframes, descriptive statistics and fugure plotting

Methods

Rmd

The standard Poisson, negative binomial (NB), zero-inflated Possion (ZIP), zero-inflated negative binomial (ZINB), hurdle Poisson (HP), hurdle negative binomial (HNB) models are the various models used.

Their corresponding R syntaxes are showing the table that follows below

Model Syntax
Poisson glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data)
NB glm("dependent variable ~ x1 + x2 + ... + xn", family = negative.binomial(1), data = count_data
ZIP zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson")
ZINB zeroinfl(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin")
HP hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "poisson")
HNB hurdle(dependent variable ~ x1 + x2 + ... + xn, data = count_data, dist = "negbin")

The results output of the each model is generated by "summary(model)"

# using the Possion model
poisson <- glm("dependent variable ~ x1 + x2 + ... + xn", family = 'poisson', data = count_data)
summary(poisson)

The libraries are readxl, mass, pscl

ModelComparison.Rmd

In the "ModelComparison.Rmd" file, "count_data" is the name assigned to the schistosomiasis count data.

  • The data contains the columns; "Age", "Sex", "Class", "Community", "S_haematobium" , "S_mansoni", "Parent_Occupation", "Pipe_borne", "Tanker_treated", "Tanker_Untreated", "River_Stream", "Well_Borehole", "Age_group", "EduLevel", "Area", "schistosomiasis"

  • The five predictors used are; Age_group, Sex, Educational level, Area and Parent's occupation and ten predictors used are; Age_group, Sex, Educational level, Area, Parent's occupation, Pipe_borne, Tanker_treated, Tanker_Untreated, River_Stream and Well_Borehole

  • The different intensity data used are; the all intensity (all sample data) and the low intensity count data named "count_data" and "low_intensity_data" respectively in the Rmd file.

  • The models are assigned names in the format {modelname}.model_{intensity data type}{number of variable}. For example, if the Poisson model is applied using all intensity data and 10 predictors then we have the name assign to such syntax as poisson.model_all10. For a negative binomial applied with the low intensity data with 5 predictors, we will have negbin.model_all10.

  • The AIC and BIC are derived with the syntax below AIC({assigned model name}) and BIC({assigned model name}) respectively.

Rootograms

Model predicted and observed values are derived from ROOTOGRAM model where its output is a plot of the expected values overlayed on the observed values but when an argument plot = FALSE is set, the values in a table form is assigned to a name given

For the observed and expected values for the model name poisson.model_all10 and others, the observed and expected value is derived with the syntax below

poisson_results <- rootogram(poisson.model_all10, max = 50, style = "standing", plot = FALSE)
negbin_results <- rootogram(negbin.model_all10, max = 50, style = "standing", plot = FALSE)
zip_results <- rootogram(zip.model_all10, max = 50, style = "standing", plot = FALSE)
zinb_results <- rootogram(zinb.model_all10, max = 50, style = "standing", plot = FALSE)
hurdlePoisson_results <- rootogram(hurdlep.model_all10, max = 50, style = "standing", plot = FALSE)
hurdleNB_results <- rootogram(hurdlenb.model_all10, max = 50, style = "standing", plot = FALSE)

and the output results are saved as an excel file with the command

write_xlsx(poisson_results,"C:\\Users\\User\\Desktop\\paper review\\model results\\poisson_results.xlsx")

Odds ratio

Calculating the odds ratio and 95% confidence interval for each predictor variable from the hurdle NB model's results

exp(cbind(Odds_Ratio = coef(hurdlenb.model_all10), confint(hurdlenb.model_all10)))

Likelihood ratio test

The syntax for finding the likelihood ratio test for nested models (between using 5 and 10 predictors of the same model) to determine if the introduction of extra parameter is necessary for zero-capturing is stated below

# for Poisson between 5 predctors and 10 predictors
lrtest(poisson.model_all5, poisson.model_all10) 

Analysis_and_figures_plotting.ipynb

The required libraries are pandas (version '2.1.4'), numpy (version '1.26.3'), scipy (version '1.11.4'), statmodels (version '0.14.0'), matplotlib (version '3.8.0') and seaborn (version '0.13.2')

First, the age, community and person's class in school values were grouped into groups which are the age group (<10, 10-15 and >15), area (rural and urban) and educational level (preschool, primary and junior high) respectively.

Model's AIC values

In this section the AIC values of all scenarios used under different models were uploaded for visual comparison

Observed and model's expected values

The folder "model results" contains the each model's expected values.

These values are then plotted against the observed values in 1) a bars against each other in log aixs and 2) the expected values overlaid on the bars of the observed values

Model's residuals

The model residual was calculated by deducting the observed values from the expected values. However the corresponding plots shows

$$\displaylines{ residual = expected - observed. \\\ \text{if residual is positive, then the square root is calculated by } \\\ \sqrt{residual} \\\ \text{if residual is negative, then the square root is calculated by } \\\ -\sqrt{|residual|} }$$

and this is defined by the python code below. If

find_sqrt = lambda x: np.sqrt(x) if x>= 0 else -np.sqrt(abs(x))

poisson_r = np.array(list(map(find_sqrt, poisson_residuals))).round(0)

negbin_r = np.array(list(map(find_sqrt, negbin_residuals))).round(0)

zip_r = np.array(list(map(find_sqrt, zip_residuals))).round(0)

zinb_r = np.array(list(map(find_sqrt, zinb_residuals))).round(0)

hurdlePoisson_r = np.array(list(map(find_sqrt, hurdlePoisson_residuals))).round(0)

hurdleNB_r = np.array(list(map(find_sqrt, hurdleNB_residuals))).round(0)

About

The count data and model codes for analysis for the paper "Using zero-inflated and hurdle models to analyze schistosomiasis count data of school children in the southern areas of Ghana" are presented here. The codes are a combination of Python and R codes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published