- Andrea Dacy
- Laura Lohr
- Stephanie Perillo
- Amy Tisland
Plastic pollution threatens food safety and quality, human health, coastal tourism, and contributes to climate change. Plastic pollution in the ocean has a devastating impact on marine life and ecosystems.
The purpose of this project is to analyze data on mismanaged plastic in oceans.
We hope to answer the following questions:
- What are the most common types of plastic found in the ocean?
- Which countries pollute the most plastic?
- Is there a correlation between a country's GDP (Gross Domestic Product) and ocean plastic pollution?
We chose to use PostgreSQL and various machine learning models. We then created a dashboard in Tableau.
-
https://www.kaggle.com/code/mihailpavlyuk/world-map-plasticwaste
-
https://wesr.unep.org/downloader (Plastic on beach tonnes)
-
https://www.kaggle.com/datasets/maartenvandevelde/marine-litter-watch-19502021
-
https://ourworldindata.org/grapher/per-capita-plastic-waste-vs-gdp-per-capita
- Dropping columns/excluding data
- Elimination null values
- Renaming columns
- Assigning new values to Country codes and plastic pollution
- Created a diagram to combine tables for PostgreSQL
- The image below represents the common connect between our datasets - country:
There were actually 164 rows in the Data
table, each coding to a different type of waste. This ERD only shows a sample of this data.
- Read in data from S3 Buckets for four CSV files
- Connect to the AWS RDS instance and wrote each dataframe into four tables
- A PostgreSQL database, "plasticpollutiondb" was created along with ten tables
What machine learning models did we use?
Primarily supervised learning models. We used K Means Clustering (unsupervised learning), linear regression, and logistic regression (both supervised learning). We used the Balanced Random Forest Classifier, Easy Ensemble Classifier, oversampling, undersampling, SMOTE Oversampling, and SMOTEENN.
Why did we choose the models we did?
We used Linear Regression because it is the easiest and most popular models to look at relationships between the variables.
We used Logistic Regression to try to predict whether or not a country’s GDP or population would determine how much plastic waste they had. For logistic regression, we used the Balanced Random Forest Classifier, Easy Ensemble Classifier, oversampling, undersampling, SMOTE Oversampling, and SMOTEENN because we wanted to use a variety of methods.
We used various models because we wanted to see which model would give us the best results. We had previously used several supervised and unsupervised models in our class modules. We wanted to find the one that would have the best performance with our particular dataset and questions.
What was our process? How did we do it? What data did we use?
We used two data sets that we merged—one containing the population and GDP and the other containing the metric tons data on plastic waste. For the logistic regression, we created bins to classify our metric tonnage for each country based upon their totals: Low, Medium, High, Extreme. Using these categories, we were able to run the data through the models and try to determine if there was any correlation.
What did our models find?
Our models were not conclusive.
Although not conclusively, our models did seem to indicate that countries with lower GDP had higher instances of mismanaged waste or mismanaged waste that was equal to the higher GDP countries.
What we found was not what we expected. We expected that the higher the GDP, and therefore, the higher the consumerism, the higher the plastic waste.
If we had more time, what would we explore next?
If we had more time, we would explore the reasons why we did not find what we thought we would and look into other dynamics that our data did not illuminate for us. Where is this plastic waste coming from? Is it landfills? Sewers? Which industries produce the most plastic waste? Are countries importing waste to other countries?
What was the limitations of our data/machine learning models? What challenges did we have with creating/applying machine learning?
Our data had only 492 rows. If we were to dig into this topic more robustly, we would likely want to look at larger data sets. An issue we ran into was that our data sets did not match. For example, for some of the years we had metrics on some variables but the other variables we wanted to explore were for other years. This complicated our process. We had already cleaned our data and prepared it for analysis before we realized that our data set was not as complete as we would have liked.
China, India, Brazil, Indonesia, Nigeria, Pakistan, Bangladesh, Egypt were some of the highest contributors of mismanaged waste. We were able to see that through our clustering.
Mismanaged waste does not increase proportionally with GDP. There are outliers, however, our data did not support a direct correlation.
- Interactive dashboard was created in Tableau
An example of some of the features of our dashboard:
Our analysis revealed many interesting findings. In Europe, cigarette butts and filters were the most common type of plastic waste collected on beaches, by far. Various sized plastic and polystyrene pieces were also among the most common types, followed by plastic caps/lids, shopping bags, and food packaging. Spain and Romania contributed the most to the amount of cigarette butts and filters found on European beaches. We also found that the countries that had the highest amounts of mismanaged plastic waste may not necessarily have the highest GDP. China, which has the highest population and a relatively low GDP, produces the most mismanaged plastic waste. On the other hand, the United States has a high population and GDP, but a disproportionately low amount of mismanaged plastic waste. This is likely because the US and other countries ship their waste to other countries to be processed.
- Having more time in discovering data sets
- Choosing more robust data sets so that machine learning models are more effective
- Examine how much waste countries export to other countries
- Find data on the types of plastic pollution found in areas outside of Europe
- Additional predictions considering The Ocean Cleanup's efforts of removing ocean garbage and intercepting river waste from entering oceans