This project is an analysis of Enron persons of interest (POIs) using Machine Learning algorithms. The project was created for Udacity's Data Analyst Nanodegree.
Access the final report here: https://sbsousa.github.io/EnronML
Per Udacity, the goal of this project is to "play detective and put your machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset."
First, I performed a thorough Exploratory Data Analysis to gain a better understanding of the data. Next, I shaped the data and removed outliers. Then, I used SelectKBest to determine the best features for the machine learning algorithms. Finally, I created Naive Bayes and Decision Tree algorithms to process the data.
The project was created in Python and a Jupyter Notebook. Multiple Python packages were used including scikit-learn, Pandas, NumPy, Matplotlib, and Seaborn. The final Jupyter report is provided in HTML format.
The Udacity files were modified to work with Python 3.9 and current packages. If you attempt to use these files, they may not work unless you recreate my environment using the packages in requirements.txt
- poi_id.py: creates the pickle (pkl) files
- tester.py: validates the selected machine learning algorithms against the pkl files and returns metrics (Accuracy, Precision, Recall, and F1)
This project is publicly available for educational purposes. Please acknowledge this source if you use it.
The Python scripts were provided by Udacity:
https://www.udacity.com/course/data-analyst-nanodegree--nd002
Udacity code that I modified is commented.
Additional sources are acknowledged in the code and report.