U.S. Census Bureau Data: https://www.census.gov/
World Values Survey: https://www.worldvaluessurvey.org/
Pew Research Center Data: https://www.pewresearch.org/
Human Rights Data Analysis Group (HRDAG): https://hrdag.org/
Global Terrorism Database: https://www.start.umd.edu/gtd/
National Crime Victimization Survey: https://www.bjs.gov/ncvs/
Twitter API: https://developer.twitter.com/en/docs/twitter-api
https://www.openml.org/search?type=data
winequalityN come from: https://www.kaggle.com/datasets/shelvigarg/wine-quality-dataset data.csv comes from: https://www.kaggle.com/datasets/shree1992/housedata
cars.csv comes from: https://www.kaggle.com/datasets/abineshkumark/carsdata
Spotify Classifier: https://www.kaggle.com/datasets/geomack/spotifyclassification
https://sports-statistics.com/sports-data/sports-data-sets-for-data-modeling-visualization-predictions-machine-learning/
Michael Jordan and Shaquille O'Neil Career Stats: Classification (Win is the Target Variable)
NBA shot logs: Data on shots taken during the 2014-2015 season, which player took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more.
Diamonds : Multiclass Classification and/or Regression
This list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task
🎯 Targets: ‘carat’ or ‘price’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No
Abalone Dataset: Classification / Regression (Male/Female or Age)
This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.
🎯 Target: ‘Rings’
🔗 Link: Kaggle
📦Dimensions: (4177, 9)
⚙Missing values: No
King County Real Estate Dataset:
This is the dataset for those who are still interested in real estate and house prices regression
🎯 Target: ‘price’
🔗 Link: Kaggle
📦Dimensions: (21613, 17)
⚙Missing values: Yes
Cancer death rate Dataset
This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables. These data were aggregated from a number of sources including the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. Most of the data preparation process can be veiwed here.
🎯 Target: ‘TARGET_deathRate’
🔗 Link: Data.world
📦Dimensions: (3047, 33)
⚙Missing values: Yes
Life Expectancy (WHO)
How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them
🎯 Target: ‘Life expectancy.’
🔗 Link: Kaggle
📦Dimensions: (2938, 21)
⚙Missing values: Yes
Car prices
The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles.
🎯 Target: ‘selling_price’ 🔗 Link: Kaggle 📦Dimensions: (8128, 12) ⚙Missing values: Yes
7️⃣. NBA rookie stats The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:
🎯 Target: ‘TARGET_5Yrs’ 🔗 Link: Data.world 📦Dimensions: (8128, 12) ⚙Missing values: Yes
Stroke prediction Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features: 🎯 Target: ‘stroke’ 🔗 Link: Kaggle 📦Dimensions: (5110, 11) ⚙Missing values: Yes
Water potability Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties: 🎯 Target: ‘Potability’ 🔗 Link: Kaggle 📦Dimensions: (3276, 10) ⚙Missing values: Yes
Smart grid stability This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):
🎯 Target: ‘stabf’ 🔗 Link: Kaggle 📦Dimensions: (60000, 13) ⚙Missing values: No
IBM HR analytics & employee attrition This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role): 🎯 Target: ‘Attrition’ 🔗 Link: Kaggle 📦Dimensions: (1470, 35) ⚙Missing values: No
Can I eat this mushroom? Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (8124, 23) ⚙Missing values: Yes
Banknote authentication Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (1372, 5) ⚙Missing values: No
Adult income dataset Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.: 🎯 Target: ‘income’ 🔗 Link: Kaggle 📦Dimensions: (48842, 15) ⚙Missing values: Yes
Yeast classification This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species: 🎯 Target: ‘class_protein_localization’ 🔗 Link: OpenML 📦Dimensions: (1484, 9) ⚙Missing values: No
-
mlb_salaries_2014.csv Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.
-
disease_democ.csv Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.
-
gdp_pc.csv World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.
-
nations.csv Data from the World Bank Indicators portal, which is an incredibly rich resource. Contains the following fields:iso2c iso3c Two- and Three-letter codes for each country, assigned by the International Organization for Standardization.
-
oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.
-
ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.
-
urls.xls A spreadsheet that we’ll use in webscraping.
Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:
-
pfizer.csv Payments made by Pfizer to doctors across the United States in the second half on 2009.
-
fda.csv Data on warning letters sent to doctors by the U.S. Food and Drug Administration, because of problems in the way in which they ran clinical trials testing experimental treatments. Contains the following variables:
-
food_stamps.csv U.S. Department of Agriculture data on the number of participants, in millions, and costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015.
-
kindergarten.csv Data from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015.
-
-gpd_pc.csv gdp_pc.csvt CSV file with World Bank data on GDP per capita for the world’s nations in 2014, plus ancillary file for QGIS to understand the data types for each field.
-
warming.csv NASA data on the annual average global temperature, from 1880 to 2015, compared the the average from 1951-1980.
Global Terrorism Database Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.
You can download the data from here: https://gtd.terrorismdata.com/, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.
The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field.
Do take care to read the Terms of Use and instructions for citing the source of the GTD data.