final_report.Rmd

---
title: "Health Insurance Report"
author: "Viktoriia Tantsiura, Valeriia Zinoveva, Mercédesz Lehoczky"
output: html_document
date: "Last edited `r Sys.time()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, 
  warning = FALSE, 
  message = FALSE, 
  dpi = 300)
```

```{r data, include = FALSE}
library(readr)
library(rio)
library(ggplot2)
library(tidyverse)
library(Hmisc)
library(broom)
library(ggpubr)
library(corrplot)
health_insurance <- read_csv("data/insurance.csv")
```

# Health Insurance Dataset Cost Prediction

For Class Exercise 1 we took `health_insurance` dataset from [the Kaggle](https://www.kaggle.com/datasets/annetxu/health-insurance-cost-prediction). The dataset has 1338 observations on insurance expenditure for people of different age, sex, body mass index (BMI) and other. ![](https://media.istockphoto.com/id/1005014108/photo/individual-health-insurance-policy-and-stethoscope.jpg?s=612x612&w=0&k=20&c=m10Up9PPs23BOOvTSKPCBhhNr0PneobgjLAHW05G2ek=)

```{r}
# Browse the dataset
health_insurance
```

## Exploratory analysis

```{r}
# Load packages
library(rio)
library(ggplot2)
library(tidyverse)
library(broom)
library(ggpubr)
library(corrplot)
```

We explored descriptive statistics of different variables included in `health_insurance` and picked the charges as dependent variable and the age and sex as independent variables of interest.

### Expenditure
``` {r}
# Exploring charges as dependent variable
```
``` {r}
summary(health_insurance$charges)
```

Now we want to see the distibution of the data.

```{r}
plot(density(health_insurance$charges))
```

The distribution is right-skewed. What is affecting the distribution?

### Age
``` {r}
# Exploring age as independent variable
```
```{r}
summary(health_insurance$age)
ggplot(health_insurance, aes(age)) +
  geom_histogram(binwidth = 1) 
```

The distribution of the age is almost uniform.

### Sex
``` {r}
# Exploring age as independent variable
```
``` {r}
health_insurance$sex <- factor(health_insurance$sex)
summary(health_insurance$sex)
plot(health_insurance$sex)
```

We have almost equal amount of women and men in dataset.

## Hypothesis testing

Let's assume that age is affecting the health insurance expenditure. It's a good practice to visualize this thought before making hypothesis.

```{r}
# Making a scatterplot
ggplot(health_insurance, aes(age, charges)) + 
  geom_point() +
  labs(x = "Age", y = "Charges") + 
  ggtitle("Charges vs. Age")
```
We can see that there is possibly some positive correlation between charges and age.

Will there be any interesting findings if we split by sex?
```{r}
# Split by sex
ggplot(health_insurance, aes(age, charges)) +
  geom_point() +
  labs(x = "Age", y = "Charges") +
  ggtitle("Charges vs. Age by Sex") +
  scale_color_manual(values = c("red", "blue")) +  
  aes(color = sex)
```

Even though there may not be a noticeable relationship between the variables of sex and charges, we can still utilize this dataset to improve our data manipulation skills. One way to do this is by isolating the female population and excluding the male observations from our analysis.

### Research question

Do females spend more money on insurance after age of 35?

```{r}
# Filtering the data
health_insurance_f <- health_insurance %>% 
  filter(sex == "female") %>% 
  mutate(group = ifelse(age >= 35, "Above 35", "Below 35"))
   

health_insurance_f %>%
  count(group)  
```

```{r}
# Calculating groupwise summary statistics
health_insurance_f %>% 
  group_by(group) %>%
  summarise(mean_charges = mean(charges))
```
Is this difference is statistically significant or could it be explained by sampling variability?

### Empirical hypothesis 

$H_0$ = Females after age 35 spend less than or same amount of money as females before 35 age
$H_a$ = Females after age 35 spend more money than females before 35 age

The Welch t-test was chosen to o compare the means between two independent groups.

```{r}
t.test(charges ~ group, data = health_insurance_f, alternative = "greater")
```

P-value is less that 0.05, therefore we reject the null hypothesis and conclude that there is evidence to support the alternative hypothesis that females after age 35 spend more money on healthcare expenses than females before 35 years of age. This change is statistically significant and cannot be explained by sampling variability alone.