This repository contains a collection of Jupyter notebooks showcasing various data cleaning and dataset creation projects using Pandas and various Python libraries built for web scraping.
- Project 1: Roller Coaster Data
- Project 2: FredAPI Unemployment Data
- Project 3: Social Security Registered Names Data
- Project 4: IMF Exchange Rate Data
- Project 5: Standford Open Policing Project
- Project 6: TED Talks Dataset
- Project 1: 100 Meter Olympics Dataset using readHtml
- Project 2: ChatGPT Twitter Dataset using snscrape
- Project 3: Social Security Registered Names using wget
- Project 4: IMF Exchange Rates Dataset using readHtml
- Project 5: Airline Safety using CSV
pandas_and_beyond/
├── analyze_data/
│ ├── __init__.py
│ ├── 00_project.ipynb
│ ├── 01_project.py
│ ├── ...
├── data/
│ ├── __init__.py
│ ├── external/
│ │ ├── external_data.csv
│ │ └── ...
│ └── generated/
│ ├── raw/
│ │ ├── raw_data.csv
│ │ └── ...
│ ├── cleaned/
│ │ ├── cleaned_data.csv
│ │ └── ...
│ └── ...
├── generate_data/
│ ├── __init__.py
│ ├── 00_create_dataset.ipynb
│ ├── 01_web_scrape.ipynb
│ ├── ...
│ └── using_csv/
│ ├── __init__.py
│ ├── 00_read_csv.ipynb
│ ├── ...
│ └── ...
├── helper/
│ ├── __init__.py
│ ├── helper_function.ipynb
│ ├── helper_module.py
│ └── ...
└── tests/
├── __init__.py
├── test_.py
└── ...
- Python 3.6 or higher
- Pandas 1.0 or higher
- Jupyter Notebook
To get started with this repository, you will need to clone or download it to your local machine. Once you have done so, you can navigate to analyze data directory and open the corresponding Jupyter notebook.
This repository is a part of my continuous learning journey, which has been inspired by the valuable contributions made by various members of the Kaggle community. If you have a data cleaning project that you have implemented using Pandas and would like to contribute to this repository, please create a new branch and submit a pull request. Your contributions are highly appreciated and will help other learners who are looking to enhance their data cleaning skills using Pandas.