data.qmd

---
title: Data
description: We describe the sources of our data and the cleaning process.
toc: true
draft: false
---

![](images/data-import-cheatsheet-thumbs.png)


This comes from the file `data.qmd`.

Your first steps in this project will be to find data to work on.

I recommend trying to find data that interests you and that you are knowledgeable about. A bad example would be if you have no interest in video games but your data set is about video games. I also recommend finding data that is related to current events, social justice, and other areas that have an impact.


Initially, you will study _one dataset_ but later you will need to combine that data with another dataset. For this reason, I recommend finding data that has some date and/or location components. These types of data are conducive to interesting visualizations and analysis and you can also combine this data with other data that also has a date or location variable.
Data from the census, weather data, economic data, are all relatively easy to combine with other data with time/location components.


## What makes a good data set?

* Data you are interested in and care about.
* Data where there are a lot of potential questions that you can explore.
* A data set that isn't completely cleaned already.
* Multiple sources for data that you can combine.
* Some type of time and/or location component.


## Where to keep data?


Below 50mb: In `dataset` folder

Above 50mb: In `dataset_ignore` folder. This folder will be ignored by `git` so you'll have to manually sync these files across your team.

### Sharing your data


For small datasets (<50mb), you can use the `dataset` folder that is tracked by github. Add the files just like you would any other file.

If you create a folder named `data` this will cause problems.

For larger datasets, you'll need to create a new folder in the project root directory named `dataset-ignore`. This will be ignored by git (based off the `.gitignore` file in the project root directory) which will help you avoid issues with Github's size limits. Your team will have to manually make sure the data files in `dataset-ignore` are synced across team members.

Your [load_and_clean_data.R](/scripts/load_and_clean_data.R) file is how you will load and clean your data. Here is a an example of a very simple one.

```{r}
source(
  "scripts/load_and_clean_data.R",
  echo = TRUE # Use echo=FALSE or omit it to avoid code output  
)
```
You should never use absolute paths (eg. `/Users/danielsussman/path/to/project/` or `C:\MA415\\Final_Project\`).

You might consider using the `here` function from the [`here` package](https://here.r-lib.org/articles/here.html) to avoid path problems.

### Load and clean data script

The idea behind this file is that someone coming to your website could largely replicate your analyses after running this script on the original data sets to clean them.
This file might create a derivative data set that you then use for your subsequent analysis.
Note that you don't need to run this script from every post/page.
Instead, you can load in the results of this script, which could be plain text files or `.RData` files. In your data page you'll describe how these results were created. If you have a very large data set, you might save smaller data sets that you can use for exploration purposes.
To link to this file, you can use `[cleaning script](/scripts/load_and_clean_data.R)` wich appears as [cleaning script](/scripts/load_and_clean_data.R). 

----

## Rubric: On this page

You will

* Describe where/how to find data.
  * You must include a link to the original data source(s). Make sure to provide attribution to those who collected the data.
  * Why was the data collected/curated? Who put it together? (This is important, if you don't know why it was collected then that might not be a good dataset to look at.
* Describe the different data files used and what each variable means. 
  * If you have many variables then only describe the most relevant ones and summarize the rest.
* Describe any cleaning you had to do for your data.
  * You *must* include a link to your `load_and_clean_data.R` file.
  * Rrename variables and recode factors to make data more clear.
  * Also, describe any additional R packages you used outside of those covered in class.
  * Describe and show code for how you combined multiple data files and any cleaning that was necessary for that.
  * Some repetition of what you do in your `load_and_clean_data.R` file is fine and encouraged if it helps explain what you did.
* Organization, clarity, cleanliness of the page
  * Make sure to remove excessive warnings, use clean easy-to-read code (without side scrolling), organize with sections, use bullets and other organization tools, etc.
  * This page should be self-contained.