-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_clean-data.Rmd
145 lines (117 loc) · 3.71 KB
/
01_clean-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "01_clean-data"
author: "Francesco Grassi"
date: "2022-10-16"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Aim
This script cleans up the file containing raw data about deaths in the hometown.
The file comes from: https://www.istat.it/storage/dati_mortalita/decessi-comunali-giornalieri-regioni(excel)-21-07-2022-5.zip
The link downloads a .zip folder with an .xlsx file for each Italian region. Each file contains data regarding the daily number of deaths in each town of the region in the period 01/01/2011 - 31/07/2022.
The file of interest is "comuni_giornaliero_31luglio2022_Puglia.xlsx", which has been renamed in this project as "raw_all-region".
The file contains the following variables:
- REG: Code indicating Italian Region
- PROV: Code indicating Italian Province within Region
- NOME_REGIONE: Name of the Region
- NOME_PROVINCIA: Name of the Province
- NOME_COMUNE: Name of the city/town
- COD_PROVCOM: Code indicating the city/town
- TIPO_COMUNE: 1 if data is available for the entire assessment period; 2 if data from 2022 is missing
- CL_ETA: Age range of the deceased person
+ 0: 0
+ 1: 1-4
+ 2: 5-9
+ 3: 10-14
+ 4: 15-19
+ 5: 20-24
+ 6: 25-29
+ 7: 30-34
+ 8: 35-39
+ 9: 40-44
+ 10: 45-49
+ 11: 50-54
+ 12: 55-59
+ 13: 60-64
+ 14: 65-69
+ 15: 70-74
+ 16: 75-79
+ 17: 80-84
+ 18: 85-89
+ 19: 90-94
+ 20: 95-99
+ 21: 100+
- GE: day of death (MM/DD)
- Many columns with name structure "SEX_YY", indicating number of deaths for a given SEX ("M": Male; "F": Female, "T": Total) for the year YY (range: 2011-2022)
# Libraries
```{r}
library(tidyverse)
library(readxl)
library(lubridate)
```
# First cleaning
- Load raw data
- Select only the town of "Castelluccio dei Sauri"
- Get rid of all Region/Province names and codes, as well as "Total" death count columns
- Convert variable "GE" in date format "%d-%m"
- Replace "n.d." in death counts with NAs
- Convert death counts columns to numeric
```{r}
raw_data <- read_excel(path = "data/raw_all-region.xlsx", trim_ws = TRUE)
raw_data <- raw_data %>%
filter(NOME_COMUNE == "Castelluccio dei Sauri") %>%
select(!c(REG:TIPO_COMUNE, starts_with("T_"))) %>%
mutate(GE = format(as.Date(GE, format = "%m%d"), "%d-%m")) %>%
na_if("n.d.") %>%
mutate_at(vars(-CL_ETA, -GE), as.numeric)
```
# Deal with Dates
While column "GE" contains the day and month of decease, the year is stored in the variable name of the death counts columns. In order to obtain a proper date variable:
- Convert data.frame to long format, splitting death count variable names in "Sex" and "Year"
- Combine "GE" and "Year" into one variable and convert to date format
```{r}
raw_data_long <- raw_data %>%
pivot_longer(!c(CL_ETA, GE),
names_to = c("Sex", "Year"),
names_sep = "_",
values_to = "Count") %>%
unite("Date", GE, Year, sep = "-") %>%
mutate(Date = dmy(Date))
```
# Final cleaning
- Remove rows with no death (0 or NAs in "Count")
- Replace "CL_ETA" with corresponding age range
- Rename variables
- Save final data.frame
```{r}
raw_data_long <- raw_data_long %>%
filter(Count > 0) %>%
mutate(CL_ETA = recode(CL_ETA,
`0` = "0",
`1` = "1-4",
`2` = "5-9",
`3` = "10-14",
`4` = "15-19",
`5` = "20-24",
`6` = "25-29",
`7` = "30-34",
`8` = "35-39",
`9` = "40-44",
`10` = "45-49",
`11` = "50-54",
`12` = "55-59",
`13` = "60-64",
`14` = "65-69",
`15` = "70-74",
`16` = "75-79",
`17` = "80-84",
`18` = "85-89",
`19` = "90-94",
`20` = "95-99",
`21` = "100+"
)) %>%
rename(Age_Range = CL_ETA)
write_csv(raw_data_long, file = "data/clean-data.csv")
```