Ocean Plastic Pollution

Collaborators:

Andrea Dacy
Laura Lohr
Stephanie Perillo
Amy Tisland

Project Overview

Plastic pollution threatens food safety and quality, human health, coastal tourism, and contributes to climate change. Plastic pollution in the ocean has a devastating impact on marine life and ecosystems.

The purpose of this project is to analyze data on mismanaged plastic in oceans.

We hope to answer the following questions:

What are the most common types of plastic found in the ocean?
Which countries pollute the most plastic?
Is there a correlation between a country's GDP (Gross Domestic Product) and ocean plastic pollution?

We chose to use PostgreSQL and various machine learning models. We then created a dashboard in Tableau.

Links to Tableau & Google Slides presentation:

Click here for Dashboard

Click here for Presentation

Datasets:

Analysis & Results

Initial Data exploration phase

Dropping columns/excluding data
Elimination null values
Renaming columns
Assigning new values to Country codes and plastic pollution
Created a diagram to combine tables for PostgreSQL

The image below represents the common connect between our datasets - country:

There were actually 164 rows in the Data table, each coding to a different type of waste. This ERD only shows a sample of this data.

Amazon Web Service (AWS) RDS instance & Database

Read in data from S3 Buckets for four CSV files
Connect to the AWS RDS instance and wrote each dataframe into four tables

Click here for File

A PostgreSQL database, "plasticpollutiondb" was created along with ten tables

Machine Learning Model

What machine learning models did we use?

Primarily supervised learning models. We used K Means Clustering (unsupervised learning), linear regression, and logistic regression (both supervised learning). We used the Balanced Random Forest Classifier, Easy Ensemble Classifier, oversampling, undersampling, SMOTE Oversampling, and SMOTEENN.

Why did we choose the models we did?

We used Linear Regression because it is the easiest and most popular models to look at relationships between the variables.

We used Logistic Regression to try to predict whether or not a country’s GDP or population would determine how much plastic waste they had. For logistic regression, we used the Balanced Random Forest Classifier, Easy Ensemble Classifier, oversampling, undersampling, SMOTE Oversampling, and SMOTEENN because we wanted to use a variety of methods.

We used various models because we wanted to see which model would give us the best results. We had previously used several supervised and unsupervised models in our class modules. We wanted to find the one that would have the best performance with our particular dataset and questions.

What was our process? How did we do it? What data did we use?

We used two data sets that we merged—one containing the population and GDP and the other containing the metric tons data on plastic waste. For the logistic regression, we created bins to classify our metric tonnage for each country based upon their totals: Low, Medium, High, Extreme. Using these categories, we were able to run the data through the models and try to determine if there was any correlation.

What did our models find?

Our models were not conclusive.

Although not conclusively, our models did seem to indicate that countries with lower GDP had higher instances of mismanaged waste or mismanaged waste that was equal to the higher GDP countries.

What we found was not what we expected. We expected that the higher the GDP, and therefore, the higher the consumerism, the higher the plastic waste.

If we had more time, what would we explore next?

If we had more time, we would explore the reasons why we did not find what we thought we would and look into other dynamics that our data did not illuminate for us. Where is this plastic waste coming from? Is it landfills? Sewers? Which industries produce the most plastic waste? Are countries importing waste to other countries?

What was the limitations of our data/machine learning models? What challenges did we have with creating/applying machine learning?

Our data had only 492 rows. If we were to dig into this topic more robustly, we would likely want to look at larger data sets. An issue we ran into was that our data sets did not match. For example, for some of the years we had metrics on some variables but the other variables we wanted to explore were for other years. This complicated our process. We had already cleaned our data and prepared it for analysis before we realized that our data set was not as complete as we would have liked.

Conclusions for Machine Learning

China, India, Brazil, Indonesia, Nigeria, Pakistan, Bangladesh, Egypt were some of the highest contributors of mismanaged waste. We were able to see that through our clustering.

Mismanaged waste does not increase proportionally with GDP. There are outliers, however, our data did not support a direct correlation.

Dashboard

Interactive dashboard was created in Tableau

An example of some of the features of our dashboard:

Analysis Results

Our analysis revealed many interesting findings. In Europe, cigarette butts and filters were the most common type of plastic waste collected on beaches, by far. Various sized plastic and polystyrene pieces were also among the most common types, followed by plastic caps/lids, shopping bags, and food packaging. Spain and Romania contributed the most to the amount of cigarette butts and filters found on European beaches. We also found that the countries that had the highest amounts of mismanaged plastic waste may not necessarily have the highest GDP. China, which has the highest population and a relatively low GDP, produces the most mismanaged plastic waste. On the other hand, the United States has a high population and GDP, but a disproportionately low amount of mismanaged plastic waste. This is likely because the US and other countries ship their waste to other countries to be processed.

Recommendations and improvements for future analysis:

Having more time in discovering data sets
Choosing more robust data sets so that machine learning models are more effective
Examine how much waste countries export to other countries
Find data on the types of plastic pollution found in areas outside of Europe
Additional predictions considering The Ocean Cleanup's efforts of removing ocean garbage and intercepting river waste from entering oceans

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
Database		Database
Machine_Learning		Machine_Learning
Resources		Resources
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ocean Plastic Pollution

Collaborators:

Project Overview

Links to Tableau & Google Slides presentation:

Datasets:

Analysis & Results

Initial Data exploration phase

Amazon Web Service (AWS) RDS instance & Database

Machine Learning Model

Conclusions for Machine Learning

Dashboard

Analysis Results

Recommendations and improvements for future analysis:

About

Releases

Packages

Contributors 4

Languages

lllohr/Ocean_Plastic_Pollution

Folders and files

Latest commit

History

Repository files navigation

Ocean Plastic Pollution

Collaborators:

Project Overview

Links to Tableau & Google Slides presentation:

Datasets:

Analysis & Results

Initial Data exploration phase

Amazon Web Service (AWS) RDS instance & Database

Machine Learning Model

Conclusions for Machine Learning

Dashboard

Analysis Results

Recommendations and improvements for future analysis:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages