by Yoon-gu Hwang, November 15, 2015
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.
Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]
The goal of this project is to use the email and financial data to identify fraud, person of interest(POI). The dataset has total 146 people information including their name. Each person information data contains 21 features(poi label + 14 financial + 6 email). There are 18 POIs and 128 non-POIs.
Looking carefully the dataset, I found 3 outliers as followings and deleted them.
TOTAL
: Just total information of datasetLOCKHART EUGENE E
: No information : All feature are 'NaN'.THE TRAVEL AGENCY IN THE PARK
: Not a person
name | # of features |
---|---|
poi | 146 |
total_stock_value | 126 |
total_payments | 125 |
email_address | 111 |
restricted_stock | 110 |
exercised_stock_options | 102 |
salary | 95 |
expenses | 95 |
other | 93 |
to_messages | 86 |
shared_receipt_with_poi | 86 |
from_messages | 86 |
from_this_person_to_poi | 86 |
from_poi_to_this_person | 86 |
bonus | 82 |
long_term_incentive | 66 |
deferred_income | 49 |
deferral_payments | 39 |
restricted_stock_deferred | 18 |
director_fees | 17 |
loan_advances | 4 |
What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]
With SelectKBest
in sklearn
, I investigate scores of features and I attached the table.
name | score |
---|---|
exercised_stock_options | 24.541175342601967 |
total_stock_value | 23.904270290793864 |
bonus | 20.524645181851792 |
salary | 18.003739993113935 |
deferred_income | 11.321486775141238 |
long_term_incentive | 9.7721035384082544 |
restricted_stock | 9.0790766616708698 |
total_payments | 8.6727512066131069 |
shared_receipt_with_poi | 8.4326354230246814 |
loan_advances | 7.1253824688830685 |
expenses | 5.9545442921972933 |
from_poi_to_this_person | 5.1422191945069704 |
other | 4.1288734042047182 |
from_this_person_to_poi | 2.3388361146462624 |
director_fees | 2.1453342495720547 |
to_messages | 1.5942560277180795 |
deferral_payments | 0.23026270434011689 |
from_messages | 0.1753832041587958 |
restricted_stock_deferred | 0.066023245366887376 |
I did not choose features whose score is below 2.00, i.e., to_messages
, deferral_payments
, from_messages
, restricted_stock_deferred
.
Also, I created 2 new features, total_incentive
and total_income
. The reason is that I think that the more someone get financial incentive, the higher probability that he/she could be POI. In the similar way, total income might be important factor to detect POI, because people related with fraud always are supposed to have huge incomes.
total_incentive
:bonus
+long_term_incentive
total_income
:salary
+total_stock_value
As the following table shows, with increasing features upto 7, overall evaluating metrics are higher. However, after 7, recall metric suddenly drop, so I choose first 7 features for my classifier. To make this table, I used very simple GaussianNB
classifer.
# of features | accuracy | precision | recall |
---|---|---|---|
1 | 0.82909 | 0.56608 | 0.25700 |
2 | 0.83962 | 0.46275 | 0.26400 |
3 | 0.84077 | 0.47559 | 0.34100 |
4 | 0.85185 | 0.52782 | 0.35100 |
5 | 0.85636 | 0.49639 | 0.37800 |
6 | 0.84979 | 0.46680 | 0.36200 |
7 | 0.85021 | 0.47004 | 0.38050 |
8 | 0.84040 | 0.37356 | 0.29100 |
9 | 0.83580 | 0.35719 | 0.28950 |
# of features | accuracy | precision | recall |
---|---|---|---|
Top 7 features | 0.85021 | 0.47004 | 0.38050 |
Top 7 features + 2 new features | 0.84671 | 0.45866 | 0.40500 |
By adding 2 new features, there is trade-off. I got higher recall value, but accuracy and precision became lower than before adding 2 new features.
Each feature has variety range of values, so I need to normalize them. I used StandardScaler
for fianl analysis by adding it to Pipeline
.
When I used DecisionTree
, its importance analysis is shown below.
name | importance |
---|---|
total_payments | 0.191 |
total_stock_value | 0.167 |
expenses | 0.140 |
other | 0.139 |
from_messages | 0.127 |
bonus | 0.095 |
from_this_person_to_poi | 0.076 |
salary | 0.059 |
exercised_stock_options | 0.000 |
deferred_income | 0.000 |
long_term_incentive | 0.000 |
restricted_stock | 0.000 |
shared_receipt_with_poi | 0.000 |
loan_advances | 0.000 |
from_poi_to_this_person | 0.000 |
director_fees | 0.000 |
to_messages | 0.000 |
deferral_payments | 0.000 |
restricted_stock_deferred | 0.000 |
What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”]
I treid 4 algorithms to achieve maximum identification performance. The list of them shows below.
algorithm | accuracy | precision | recall |
---|---|---|---|
DecisionTreeClassifier | 0.81293 | 0.29205 | 0.28300 |
StandardScaler + LogisticRegression | 0.85360 | 0.34295 | 0.10700 |
GaussianNB | 0.81927 | 0.31280 | 0.29700 |
PCA + GaussianNB | 0.84467 | 0.37705 | 0.25300 |
DecisionTree, GaussianNB, and PCA are covered in the class. So I tried to apply and check their performance. It was suprising that the simple GaussianNB
had great performance without any trials and erros.
Logistic Regression is useful to binary classification(True/False) variables. This scheme is exactly what we want in this project. We have binary classification, POI or non-POI. So I ended up using this logistic regression algorithm.
What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric item: “tune the algorithm”]
Tuning parameters of an algorithm means the way to get the best algorithm performance with given dataset and model. If we don't do this process, we cannot get good performance and we might have worse performance than what we expected.
In this project, I trid to tune several parameter sets with 2 algorithms(LogisticRegression
and PCA+GaussianNB
). GaussianNB
had no parameter to be tuned, so I made a pipeline(PCA
+ GaussianNB
) and tuned parameter of PCA
.
LogisticRegression
penalty
: Used to specify the norm used in the penalization. The newton-cg and lbfgs solvers support only l2 penalties.C
: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.tol
: Tolerance for stopping criteria.
PCA
n_components
: Number of components to keep.
Final parameter for me to be tuned. I can avoid tedious trails by using GridSearchCV
.
LogisticRegression
penalty
: l2C
: 1e-12tol
: 1e-12
PCA
n_components
: None
What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric item: “validation strategy”]
Validation is the way to confirm the robustness of a classifier with given dataset and model. The class mistake is the over-fitting case. When a classifier is over-fitted, it cannot provide good performance on test dataset. Because the classifier is too much over-fitted to training dataset. To avoid this problem, usually dataset is partitioned to three set, (train set, validation set, test set). However, it causes the number of samples to be drastically reduced. Therefore, the results can depend on a particular random choice for the pair of (train, validation) sets.
One of solutions is cross-validation. In this process, dataset except test set is split into k smaller sets(k-folds). Then, we iterate the following procedures.
- A model is trained with (k-1) folds.
- the resulting model is validated on the remaining set(1 fold).
By doing this procedures, we can have robust classifier.
I used StratifiedShuffleSplit
. StratifiedShuffleSplit
is a merged version of ShuffleSpllit
and StratifiedKFold
. First, samples are shuffled, then, make k stratified folds as StratifiedKFold
does. I used 10% of dataset as test set, and 1000 folds. This scheme is also used in test_classifier()
in tester.py
.
Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]
I selected precision and recall as evaluation metrics. The definitions are followed.
- precision : the number of true positives over the number of true positives plus the number of false positives.
- recall : the number of true positives over the number of true positives plus the number of false negatives.
In short, high precision is equivalent to low false alram, that means an algorithm is precise. High recall means that high true positive, that is, an algorithm can identify POI as many as possibile.
algorithm | accuracy | precision | recall |
---|---|---|---|
StandardScaler + LogisticRegression | 0.85157 | 0.47869 | 0.43800 |
PCA + GaussianNB | 0.84893 | 0.46169 | 0.34650 |
I made my own score function, averaging precision and recall, then I passed it to GridSearchCV
. So I could get the optimal parameter for the above 2 algorithms.
from sklearn.metrics import make_scorer, precision_recall_fscore_support
# I want to classifier whose (precision, recall) are high at the same time.
def my_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None):
p, r, _, _ = precision_recall_fscore_support(y_true, y_pred,
labels=labels,
pos_label=pos_label,
average=average,
sample_weight=sample_weight)
if p < 0.3 or r < 0.3: # To achieve better than 0.3 precision and recall
return 0.
return ( p + r ) / 2. # Normalize mix score
The result of GaussianNB
was awesome, because it had quite good performance even though I did nothing. But LogisticRegression
is the best performance among trials I did. It had high value of precision and recall. This evaluation result is more balanced and higher values.
Final classifier is logistic regression since its performance is the best among many trials. In addition, its algorithm is specialized to binary classification. This is perfect story to identify fraud from Enron dataset.