-
Notifications
You must be signed in to change notification settings - Fork 0
/
final_report.Rmd
153 lines (119 loc) · 4.28 KB
/
final_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Health Insurance Report"
author: "Viktoriia Tantsiura, Valeriia Zinoveva, Mercédesz Lehoczky"
output: html_document
date: "Last edited `r Sys.time()`"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
dpi = 300)
```
```{r data, include = FALSE}
library(readr)
library(rio)
library(ggplot2)
library(tidyverse)
library(Hmisc)
library(broom)
library(ggpubr)
library(corrplot)
health_insurance <- read_csv("data/insurance.csv")
```
# Health Insurance Dataset Cost Prediction
For Class Exercise 1 we took `health_insurance` dataset from [the Kaggle](https://www.kaggle.com/datasets/annetxu/health-insurance-cost-prediction). The dataset has 1338 observations on insurance expenditure for people of different age, sex, body mass index (BMI) and other. ![](https://media.istockphoto.com/id/1005014108/photo/individual-health-insurance-policy-and-stethoscope.jpg?s=612x612&w=0&k=20&c=m10Up9PPs23BOOvTSKPCBhhNr0PneobgjLAHW05G2ek=)
```{r}
# Browse the dataset
health_insurance
```
## Exploratory analysis
```{r}
# Load packages
library(rio)
library(ggplot2)
library(tidyverse)
library(broom)
library(ggpubr)
library(corrplot)
```
We explored descriptive statistics of different variables included in `health_insurance` and picked the charges as dependent variable and the age and sex as independent variables of interest.
### Expenditure
``` {r}
# Exploring charges as dependent variable
```
``` {r}
summary(health_insurance$charges)
```
Now we want to see the distibution of the data.
```{r}
plot(density(health_insurance$charges))
```
The distribution is right-skewed. What is affecting the distribution?
### Age
``` {r}
# Exploring age as independent variable
```
```{r}
summary(health_insurance$age)
ggplot(health_insurance, aes(age)) +
geom_histogram(binwidth = 1)
```
The distribution of the age is almost uniform.
### Sex
``` {r}
# Exploring age as independent variable
```
``` {r}
health_insurance$sex <- factor(health_insurance$sex)
summary(health_insurance$sex)
plot(health_insurance$sex)
```
We have almost equal amount of women and men in dataset.
## Hypothesis testing
Let's assume that age is affecting the health insurance expenditure. It's a good practice to visualize this thought before making hypothesis.
```{r}
# Making a scatterplot
ggplot(health_insurance, aes(age, charges)) +
geom_point() +
labs(x = "Age", y = "Charges") +
ggtitle("Charges vs. Age")
```
We can see that there is possibly some positive correlation between charges and age.
Will there be any interesting findings if we split by sex?
```{r}
# Split by sex
ggplot(health_insurance, aes(age, charges)) +
geom_point() +
labs(x = "Age", y = "Charges") +
ggtitle("Charges vs. Age by Sex") +
scale_color_manual(values = c("red", "blue")) +
aes(color = sex)
```
Even though there may not be a noticeable relationship between the variables of sex and charges, we can still utilize this dataset to improve our data manipulation skills. One way to do this is by isolating the female population and excluding the male observations from our analysis.
### Research question
Do females spend more money on insurance after age of 35?
```{r}
# Filtering the data
health_insurance_f <- health_insurance %>%
filter(sex == "female") %>%
mutate(group = ifelse(age >= 35, "Above 35", "Below 35"))
health_insurance_f %>%
count(group)
```
```{r}
# Calculating groupwise summary statistics
health_insurance_f %>%
group_by(group) %>%
summarise(mean_charges = mean(charges))
```
Is this difference is statistically significant or could it be explained by sampling variability?
### Empirical hypothesis
$H_0$ = Females after age 35 spend less than or same amount of money as females before 35 age
$H_a$ = Females after age 35 spend more money than females before 35 age
The Welch t-test was chosen to o compare the means between two independent groups.
```{r}
t.test(charges ~ group, data = health_insurance_f, alternative = "greater")
```
P-value is less that 0.05, therefore we reject the null hypothesis and conclude that there is evidence to support the alternative hypothesis that females after age 35 spend more money on healthcare expenses than females before 35 years of age. This change is statistically significant and cannot be explained by sampling variability alone.