Predicting Churn from user click-level data with distributed computation
- Python 3.6.3
- PySpark 2.3.2
- Pandas 0.20.3
- Matplotlib 3.0.2
- seaborn 0.9.0
I am always eager to learn new frameworks and expand my capabilities, so when I heard about the possibility of a project utilized Apache Spark and Hadoop I was already very intrigued. Having learned the basics of Apache Spark's PySpark API, there is no better way of displaying machine learning prowess than in a big data context. This project revolves around a key business issue that many firms face; How can we know which customers want to leave, and how can our marketing department target them? Business applications are what excites me the most about Data Science. Proving that I can glean valuable insights from corporate-sized data sources would prove to me that I can say Big Data as more than just a buzzword.
.
├──
│ ├── Sparkify.ipynb # initial development & EDA on smaller subset of data
│ ├── Sparkify-Viz.ipynb # visualization of datafrom EDA and AWS
│ ├── sparkify_full.ipynb # AWS Implementation and final version
│ └── sparkify_full.html # final version of notebook in html format
│
└── ...
The final chosen model is a Random Forest classifier which was chosen due to it being the fastest model to train and generate predictions by half being twice as fast as the next fastest model. The classifier that predicted on a dataset of reduced dimensionality still managed exceptional performance. The final dataset was of reduced dimensionality through principal component analysis that explained >97% of the variance in the dataset.
The most interesting part of examining the features of the model is that it actually uses a much simpler version of the random forest classifier than the OOTB version. Instead of creating 32 decision trees that classify each instance, it only creates 10 during training, and lets the trees vote during prediction. The fact that this model did not make mistakes in validation or testing datasets suggests that it is a robust model.
- PySpark Window Functions
- Udacity DSND Apache Spark Lesson
- PySpark Documentation
- Calculating Differences Between Rows
- PySpark ML Package
- Converting rows to Vector type
- What is Apache Spark persistence
- PySpark ML vs PySpark MLLib
- ROC vs PR Metrics
- Choosing Scalers
- Extending Transformer Class
I encountered various errors when using AWS's EMR managed Hadoop framework that I could only attribute to errors in the backend and not with my code. What complicated manners was that the AWS hosted notebook would crash reasonably frequently.
The errors are listed below for future reference:
'StructField' object has no attribute '_get_object_id'
Traceback (most recent call last):
File "<stdin>", line 195, in custom_evaluation
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/evaluation.py", line 204, in __init__
StructField("label", DoubleType(), nullable=False)]))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 307, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1212, in _build_args
(new_args, temp_args) = self._get_args(args)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1199, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_collections.py", line 501, in convert
java_list.add(element)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
[get_command_part(arg, self.pool) for arg in new_args])
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'StructField' object has no attribute '_get_object_id'
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 57960)
----------------------------------------
Traceback (most recent call last):
File "/usr/lib64/python2.7/SocketServer.py", line 290, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib64/python2.7/SocketServer.py", line 318, in process_request
self.finish_request(request, client_address)
File "/usr/lib64/python2.7/SocketServer.py", line 331, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib64/python2.7/SocketServer.py", line 652, in __init__
self.handle()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 266, in handle
poll(authenticate_and_accum_updates)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 254, in authenticate_and_accum_updates
received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()
Exception in thread cell_monitor-10:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/awseditorssparkmonitoringwidget-1.0-py3.6.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 1148
BinaryClassificationEvaluator
always returned metricareaUnderPR = 1.0
no matter the performance of the classifier- aws notebook stopped working
Session '0' is not active