data.qmd

---
title: "DATA"
---

# DATA CLEANING AND EDA ANALYSIS

Data source: [TÜİK](https://biruni.tuik.gov.tr/medas/?kn=95&locale=tr)

We transferred the data we received from **TÜİK** as an Excel file onto R with the **`read_excel( )`** function in the **`readxl`** library.

In this code, we created "region2id" column for cities in population data set by using region26 data set. Our aim is to have common area (column) for all the data sets (population, region26 and migration) to study on the data sets comfortable. Additionaly, we renamed the columns to make it easier to work in R by using `names()` function. In this code, we created "region2id" column for cities in population data set by using region26 data set. Our aim is to have common area (column) for all the data sets (population, region26 and migration) to study on the data sets comfortable. Additionaly, We renamed the columns to make it easier to work in R by using `names()` function and we tidy up our data sets with the help of the **`pivot_longer()`** function to work with these datasets more flexibly and facilitate visualization. The tidy data sets names are **last_migration_tidy_data, tidy_region26 and last_population_tidy_data** respectively. Appart from this operation,using the population data set, we added the **city column** and **region column** to both the migration data set in tidy form and the migration data set in wide form to do better visualization.

In population data frame, we used 31 columns. **IDD** is the code that identifies the district or region. It will be used in migration analysis to track migration patterns between different regions. **Region ID** and **regions** reflect the region ID and name. Migration data will be examined on a regional basis to understand population movements between different regions. **Totalpop**, **male**, **female**, corresponds to total population, male population and female population. It will be used in migration analysis to understand how populations change from one region to another.

-   **Age groups** (age04, age59, age1014, age1519, age2024, age2529, age3034, age3539, age4044, other age) symbolize the population distribution of different age groups. Migration data will be used to examine migration trends of certain age groups from one region to another. **Unknowns** indicate unknown or uncertain population information. When analyzing migration, the impact of missing data and how this data may affect migration trends will be important.

-   **Education levels** (doctorate, primary edu, primary edu, elementary edu, highschool, literatebutnoschool, notliterate, middleschool, master university) give the population distribution at different education levels. It will be used to analyze the cities receiving immigrants according to their education levels. **Fertility rate**, the effect of the fertility rate, which is one of the factors affecting the tendency of individuals to migrate from one region to another, will be examined.

-   **Electricity** reflects Total Electricity Consumption (Kwh) per capita data. Migration may be towards areas with better infrastructure or living conditions. Migration may occur due to **housing sales numbers**, changes in the housing market, and trends in housing sales will be important in migration analysis.

In migration data frame, we used 28 columns. Since it contains common columns with the population dataframe, we explain below the columns that are different from the population data frame.

-   **Turnbackfamily** will provide information about the return loss of families, the rate of families migrating, or the return rates of families migrating. **Betterlifecond**, the search for better living conditions may be one of the reasons why people migrate. **Others** reflect other reasons for migration. This category includes a variety of reasons that are not specifically defined but may be associated with migration. **Retirement** refers to relocation after retirement or migration to preferred provinces for retirement.

-   **Buyhome** reflects home buying data. An increase in housing sales in a particular region will indicate that migration to that region has increased and people prefer that region. **Familymig** describes migrations that occur for family reasons. It represents migrations due to family reunification or family relationships.

-   **Finjob** indicates migrations for job opportunities or finding a job. People may be more likely to migrate for work or career opportunities. It describes situations such as change of marital status, marriage, and divorce. **Health** refers to migrations that occur for health reasons. **Appointment** represents a new business opportunity or change.

There are 5 columns in our region 26 data frame. We explain the columns that are different from our population and migration data frame below.

-   **Region2id** is the second region code specified. **Workforce15plus** and **workforce1564** reflect workforce data aged 15 and over and workforce data between ages 15-64, respectively. This data can influence workforce dynamics in a particular region. The labor force participation rate or workforce structure of migrating individuals may be important in terms of economic impacts and changes in employment. **Usableincome** symbolizes disposable income or income level. This data can affect income levels and economic well-being. Migration in a particular region can cause changes in income distribution and living standards in that region.

```{r}
#| code-fold: true
#| code-summary: "Click the see the code"
#| warning: false

# importing necessary packages
library(tidyverse) 
library(readxl)
library(readr)

population <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx")
names(population) <- c('IDD','city','regionid','regions','totalpop','male','female','age04','age59','age1014','age1519','age2024','age2529','age3034','age3539','age4044','otherage','unknowns','doctorate','primaryedu','elementaryedu','highschool','literatebutnoschool','notliterate','middleschool','master','university','fertilityrate','electricity','numberofattempts','housingsalesnumbers')

region26 <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx", sheet = "Bolge26")
names(region26) <- c('region','region2id','workforce15plus','workforce1564','usableincome')

migration <- read_excel("C:\\Users\\beyza\\Desktop\\DataVizards_FinalDataFrame.xlsx", sheet = "Goc Bilgileri")
names(migration) <- c('IDD','male2','female2','turnbackfamily','unknowns2','betterlifecond','others','education','retirement','buyhome','familymig','finjob','maritalstatuschange','health','appointment','age04_2','age59_2','age1014_2','age1519_2','age2024_2','age2529_2','age3034_2','age3539_2','age4044_2','university2','highschool2','middleschool2', 'elementaryschool2')


# tidy migration data set
migration$city <- population$city
migration$regions <- population$regions

tidy_data_gender <- migration |> pivot_longer(c(male2,female2),names_to = "gender2",values_to = "gender_value2")

tidy_data_causes <- tidy_data_gender |> pivot_longer(c(turnbackfamily,unknowns2,betterlifecond,others,education,
                                                retirement,buyhome,familymig,finjob,maritalstatuschange,health,appointment),names_to = "migrationcauses",values_to = "migrationcauses_value")

tidy_data_age <- tidy_data_causes |> pivot_longer(c(age04_2,age59_2,age1014_2,age1519_2,
                                                    age2024_2,age2529_2,age3034_2,age3539_2,age4044_2),names_to = "agerange2",values_to = "agerange_value2")

last_migration_tidy_data <- tidy_data_age |> pivot_longer(c(university2,highschool2,middleschool2,elementaryschool2),names_to = "education2",values_to = "education_value2")
head(last_migration_tidy_data)

# tidy region26 data set
tidy_region26 <- region26 |> pivot_longer(c(workforce15plus,workforce1564),names_to = "workforce",values_to = "workforce_values")
head(tidy_region26)

# tidy population data set
tidy_data_gender2 <- population |> pivot_longer(c(male,female),names_to = "gender",values_to = "gender_value") 

tidy_data_age <- tidy_data_gender2 |> pivot_longer(c(age04,age59,age1014,age1519,
                                                    age2024,age2529,age3034,age3539,age4044,otherage),names_to = "agerange",values_to = "agerange_value")

tidy_literate <- tidy_data_age |> pivot_longer(c(unknowns,literatebutnoschool,notliterate),names_to = "literate",values_to = "literate_values")

last_tidy_data_education2 <- tidy_literate|> pivot_longer(c(university,highschool,middleschool,elementaryedu,doctorate,primaryedu,master),names_to = "education",values_to = "education_value")

last_population_tidy_data <- last_tidy_data_education2 |> pivot_longer(c(fertilityrate,electricity,numberofattempts,housingsalesnumbers),names_to = "othervariables",values_to = "othervariables_value")
head(last_population_tidy_data)

population$`region2id`<- ""
for (i in 1:nrow(population)) {
  city <- population$city[i]
  
  for (j in 1:length(region26$region)) {
    if (grepl(city, region26$region[j])) {
      population$`region2id`[i] <- paste(region26$`region2id`[j])
      break 
    }
  }
}

```