Solution for customer churn prediction. You can find used dataset here.
- Python 3.9
- Spark 3.2.1
To get a local copy of the repository please run the following commands on your terminal:
$ cd <folder>
$ git clone https://github.com/charapennikaurm/telco_customer_churn
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
$ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.211/aws-java-sdk-1.12.211.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.12.211/aws-java-sdk-s3-1.12.211.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.12.211/aws-java-sdk-core-1.12.211.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.12.211/aws-java-sdk-dynamodb-1.12.211.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kinesis-asl_2.13/3.2.1/spark-streaming-kinesis-asl_2.13-3.2.1.jar -P $SPARK_HOME/jars/
$ wget https://repo1.maven.org/maven2/com/qubole/spark/spark-sql-kinesis_2.12/1.2.0_spark-3.0/spark-sql-kinesis_2.12-1.2.0_spark-3.0.jar -P $SPARK_HOME/jars/
To run training, put your own env variables in src/models/example.env
and then run following commands:
$ source src/models/example.env
$ python src/models/run.py
- Hyperopt is used for hyperparameter optimization
- Knock Knock is used for telegram notifications about training
- Best trained model can be saved locally as well as to Amazon S3
This options starts a process that connects to AWS Kinesis using SparkStreaming.
The code is located in src/inference
To run it set env variables in file src/inference/example.env
and then run following commands:
$ source src/inference/example.env
$ python src/inference/inference.py
The code presented in src/lambda_inference
is used for runnig inference using AWS Lambda. Lambda function is triggered Kinesis Stream when records arrives.
Docker container is used because PySpark is too big to upload code to AWS Lambda as zip-package.