GitHub

datasets

U.S. Census Bureau Data: https://www.census.gov/

World Values Survey: https://www.worldvaluessurvey.org/

Pew Research Center Data: https://www.pewresearch.org/

Human Rights Data Analysis Group (HRDAG): https://hrdag.org/

Global Terrorism Database: https://www.start.umd.edu/gtd/

National Crime Victimization Survey: https://www.bjs.gov/ncvs/

Twitter API: https://developer.twitter.com/en/docs/twitter-api

https://www.openml.org/search?type=data

winequalityN come from: https://www.kaggle.com/datasets/shelvigarg/wine-quality-dataset data.csv comes from: https://www.kaggle.com/datasets/shree1992/housedata

cars.csv comes from: https://www.kaggle.com/datasets/abineshkumark/carsdata

Spotify Classifier: https://www.kaggle.com/datasets/geomack/spotifyclassification

https://sports-statistics.com/sports-data/sports-data-sets-for-data-modeling-visualization-predictions-machine-learning/
Michael Jordan and Shaquille O'Neil Career Stats: Classification (Win is the Target Variable)
NBA shot logs: Data on shots taken during the 2014-2015 season, which player took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more.
Diamonds : Multiclass Classification and/or Regression
This list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task 🎯 Targets: ‘carat’ or ‘price’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No

Abalone Dataset: Classification / Regression (Male/Female or Age) This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.
🎯 Target: ‘Rings’
🔗 Link: Kaggle
📦Dimensions: (4177, 9)
⚙Missing values: No

King County Real Estate Dataset:
This is the dataset for those who are still interested in real estate and house prices regression
🎯 Target: ‘price’
🔗 Link: Kaggle
📦Dimensions: (21613, 17)
⚙Missing values: Yes

Cancer death rate Dataset
This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables. These data were aggregated from a number of sources including the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. Most of the data preparation process can be veiwed here.
🎯 Target: ‘TARGET_deathRate’
🔗 Link: Data.world
📦Dimensions: (3047, 33)
⚙Missing values: Yes

Life Expectancy (WHO)
How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them
🎯 Target: ‘Life expectancy.’
🔗 Link: Kaggle
📦Dimensions: (2938, 21)
⚙Missing values: Yes

Car prices The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles.

🎯 Target: ‘selling_price’ 🔗 Link: Kaggle 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Binary classification

7️⃣. NBA rookie stats The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

🎯 Target: ‘TARGET_5Yrs’ 🔗 Link: Data.world 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Stroke prediction Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features: 🎯 Target: ‘stroke’ 🔗 Link: Kaggle 📦Dimensions: (5110, 11) ⚙Missing values: Yes

Water potability Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties: 🎯 Target: ‘Potability’ 🔗 Link: Kaggle 📦Dimensions: (3276, 10) ⚙Missing values: Yes

Smart grid stability This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):

🎯 Target: ‘stabf’ 🔗 Link: Kaggle 📦Dimensions: (60000, 13) ⚙Missing values: No

IBM HR analytics & employee attrition This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role): 🎯 Target: ‘Attrition’ 🔗 Link: Kaggle 📦Dimensions: (1470, 35) ⚙Missing values: No

Can I eat this mushroom? Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (8124, 23) ⚙Missing values: Yes

Banknote authentication Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (1372, 5) ⚙Missing values: No

Adult income dataset Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.: 🎯 Target: ‘income’ 🔗 Link: Kaggle 📦Dimensions: (48842, 15) ⚙Missing values: Yes

Multi-class classification datasets

Yeast classification This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species: 🎯 Target: ‘class_protein_localization’ 🔗 Link: OpenML 📦Dimensions: (1484, 9) ⚙Missing values: No

mlb_salaries_2014.csv Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.
disease_democ.csv Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.
gdp_pc.csv World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.
nations.csv Data from the World Bank Indicators portal, which is an incredibly rich resource. Contains the following fields:iso2c iso3c Two- and Three-letter codes for each country, assigned by the International Organization for Standardization.
oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
urls.xls A spreadsheet that we’ll use in webscraping.

Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:

pfizer.csv Payments made by Pfizer to doctors across the United States in the second half on 2009.
fda.csv Data on warning letters sent to doctors by the U.S. Food and Drug Administration, because of problems in the way in which they ran clinical trials testing experimental treatments. Contains the following variables:
food_stamps.csv U.S. Department of Agriculture data on the number of participants, in millions, and costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015.
kindergarten.csv Data from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015.
-gpd_pc.csv gdp_pc.csvt CSV file with World Bank data on GDP per capita for the world’s nations in 2014, plus ancillary file for QGIS to understand the data types for each field.
warming.csv NASA data on the annual average global temperature, from 1880 to 2015, compared the the average from 1951-1980.

Global Terrorism Database Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.

You can download the data from here: https://gtd.terrorismdata.com/, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.

The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field.

Do take care to read the Terms of Use and instructions for citing the source of the GTD data.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
Cutom Question Answering		Cutom Question Answering
books		books
genai_for_da		genai_for_da
images		images
maps		maps
nlp		nlp
nyc_stop_and_frisk		nyc_stop_and_frisk
pca		pca
regression		regression
vf_templates		vf_templates
18-13-52.m4a		18-13-52.m4a
2018NBA.csv		2018NBA.csv
2018NBARookies		2018NBARookies
3.3.1 Transactions for Analysis.xlsx		3.3.1 Transactions for Analysis.xlsx
3.3.4 Transactions for Visualization.xlsx		3.3.4 Transactions for Visualization.xlsx
AAPL.csv		AAPL.csv
AI_Adoption_Playbook.pptx		AI_Adoption_Playbook.pptx
AirBnB_NYC_2019.csv		AirBnB_NYC_2019.csv
Assignment 1 - Toyota - Student Use.xlsx		Assignment 1 - Toyota - Student Use.xlsx
BPD_Arrests.csv		BPD_Arrests.csv
BankNote_Authentication.csv		BankNote_Authentication.csv
Bias_correction_ucl.csv		Bias_correction_ucl.csv
BreadBasket_DMS.csv		BreadBasket_DMS.csv
Breast-cancer-wisconsin.data		Breast-cancer-wisconsin.data
Broward County Property Appraiser 12_10_2022(1).xlsx		Broward County Property Appraiser 12_10_2022(1).xlsx
Broward County Property Appraiser 12_10_2022(2).xlsx		Broward County Property Appraiser 12_10_2022(2).xlsx
Broward County Property Appraiser 12_10_2022(3).xlsx		Broward County Property Appraiser 12_10_2022(3).xlsx
CAR DETAILS FROM CAR DEKHO.csv		CAR DETAILS FROM CAR DEKHO.csv
Car details v3.csv		Car details v3.csv
Case.csv		Case.csv
Claude Project System Prompt.pdf		Claude Project System Prompt.pdf
Claude Prompt Guidelines.pdf		Claude Prompt Guidelines.pdf
Client Progress Image.png		Client Progress Image.png
CooperUnionDataset.csv		CooperUnionDataset.csv
Corona_NLP_test.csv		Corona_NLP_test.csv
Corona_NLP_train.csv		Corona_NLP_train.csv
Credit_Default_Data_Dictionary.png		Credit_Default_Data_Dictionary.png
Data Mining Deep Dive Classification Algorithms.pdf		Data Mining Deep Dive Classification Algorithms.pdf
Data Science Bootcamp Capstone Project.pdf		Data Science Bootcamp Capstone Project.pdf
Data Wrangling Final Project.docx		Data Wrangling Final Project.docx
Data Wrangling Final Project.pdf		Data Wrangling Final Project.pdf
DeathPenalty.csv		DeathPenalty.csv
DeathPenaltyOther.csv		DeathPenaltyOther.csv
Effect of Yoga on Memory (1).pdf		Effect of Yoga on Memory (1).pdf
Effect of Yoga on Sleep (1).pdf		Effect of Yoga on Sleep (1).pdf
ErnestoSample.mp4		ErnestoSample.mp4
Example_text.txt		Example_text.txt
Final_Project_File.csv		Final_Project_File.csv
Financial Sample.xlsx		Financial Sample.xlsx
Financial pitch deck.pptx		Financial pitch deck.pptx
FlightDelays.csv		FlightDelays.csv
FuelEfficiency.csv		FuelEfficiency.csv
GlobalAirportDatabase.csv		GlobalAirportDatabase.csv
Groceries_dataset.csv		Groceries_dataset.csv
HTRU_2.csv		HTRU_2.csv
HeartDiseaseTrainTest.csv		HeartDiseaseTrainTest.csv
HousingData.csv		HousingData.csv
IQR.png		IQR.png
Influence of Yoga on Stress (1).pdf		Influence of Yoga on Stress (1).pdf
Leads Data Dictionary.xlsx		Leads Data Dictionary.xlsx
Leads.csv		Leads.csv
Life Expectancy Data.csv		Life Expectancy Data.csv
LinearOptimization.pdf		LinearOptimization.pdf
LinearOptimization.pptx		LinearOptimization.pptx
LinkedInConnections.csv		LinkedInConnections.csv
LynchingsInUS.csv		LynchingsInUS.csv
LynchingsInUSCodebook.pdf		LynchingsInUSCodebook.pdf
MAPdefinitionsSHR.pdf		MAPdefinitionsSHR.pdf
Miami Business Covid Safety Mentions.csv		Miami Business Covid Safety Mentions.csv
MyPDP_Dr_Lee.pptx		MyPDP_Dr_Lee.pptx
My_Personal_Assistant-2024-03-17_12-47.vf		My_Personal_Assistant-2024-03-17_12-47.vf
My_Personal_Assistant-2024-03-21_08-28.vf		My_Personal_Assistant-2024-03-21_08-28.vf
NBA2015_2016.txt		NBA2015_2016.txt
NBA_Standings_2015.txt		NBA_Standings_2015.txt
OldFaithful.csv		OldFaithful.csv
Online Retail (1).xlsx		Online Retail (1).xlsx
Online Retail (5).xlsx		Online Retail (5).xlsx
Online Retail.xlsx		Online Retail.xlsx
Power BI Desktop Installer.exe		Power BI Desktop Installer.exe
ProfGPT-2024-03-01_13-17.vf		ProfGPT-2024-03-01_13-17.vf
README.md		README.md
Racism_Detection_by_Analyzing_Differential_Opinions_Through_Sentiment_Analysis_of_Tweets_Using_Stacked_Ensemble_GCR-NN_Model.pdf		Racism_Detection_by_Analyzing_Differential_Opinions_Through_Sentiment_Analysis_of_Tweets_Using_Stacked_Ensemble_GCR-NN_Model.pdf
Raw_Data.csv		Raw_Data.csv
Region.csv		Region.csv
Regression_housedata.csv		Regression_housedata.csv
Resampling.ipynb		Resampling.ipynb
SHR65_23.zip		SHR65_23.zip
SMSSpamCollection.txt		SMSSpamCollection.txt
SOP_ Download & Compile n8n Documentation into Google Docs.pdf		SOP_ Download & Compile n8n Documentation into Google Docs.pdf
Sample - Superstore.xls		Sample - Superstore.xls
Sampling.pdf		Sampling.pdf
Screen Shot 2017-01-25 at 10.20.38 AM.png		Screen Shot 2017-01-25 at 10.20.38 AM.png
Screenshot 1.png		Screenshot 1.png
SharkTank-Final.csv		SharkTank-Final.csv
Sherrif.org 2_7_18_2022_all_records.csv		Sherrif.org 2_7_18_2022_all_records.csv
SpotifyFeatures.zip		SpotifyFeatures.zip
StudentsPerformace.csv		StudentsPerformace.csv
SyllabusGPT-2024-04-23_09-51.vf		SyllabusGPT-2024-04-23_09-51.vf
Terrorism_Codebook.pdf		Terrorism_Codebook.pdf
TestAutomationAssistant-2024-03-14_08-47.vf		TestAutomationAssistant-2024-03-14_08-47.vf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

datasets

Binary classification

Multi-class classification datasets

About

Uh oh!

Releases

Packages

Languages

fenago/datasets

Folders and files

Latest commit

History

Repository files navigation

datasets

Binary classification

Multi-class classification datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages