GitHub - mosheka/python_spark_lab: A lab project that practices user retention using spark streaming and hbase

#Open lab for Python and Spark

Your task this time is to create a users retention solution based on Spark. We will first collect the data, and then we will analyze the data and find patterns that will help us detect users that should have bonuses.

##Spark Batch Analysis: Creating and analysing an offline datastore Your target is to read a ORC formated data store and detect users that had significant number of loses in their last games (lets say, 70% loses in last 10 games).

Create connection to ORC based on: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_orc-spark.html
Create a new table that could store: user id, action date, game type, result (amount of win/loss)
Insert random values for 1000 users and for each 100 games w/ results between -10 and 10 using http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Find the users that in their last 10 games lost 70% of their games using RDD
Do the same using DataFrames
Compare the performance of both cases

##Spark Streaming Analysis: Detection on the fly Your target is to monitor a Kafka stream and detect users that had significant number of loses in their last games (lets say, 70% loses in last 10 games). Use for the following as a guidence: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py

Create a simulator, using Kafka producer, that generates streams of actions (user id, date, game, result) that will be sent to Kafka w/ results between -10 and 10 (user id, game type, result). Take a look here: https://github.com/dpkp/kafka-python
Create a spark micro job based on a Kafka consumer saves the streams to ORC file:
Use the following example for Kafka consumer: http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
Create a Spark streaming micro job that runs in a 5 min sliding window and detects users that had significant loses during that time (use reduceByKeyAndWindow as described here: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
Create a Spark streaming micro job that detects users in this situation even if that happened before the sliding window timeframe using updateStateByKey (see more https://github.com/apache/spark/blob/v2.1.0/examples/src/main/python/streaming/stateful_network_wordcount.py)

##The Retention Game

Assuming your participant has the following behavior:
They have 1000 units, and they cannot play if they reach 0 (no credit)
They are willing to play on 10 units in every game in a zero/double game
The player win rate in each game is 48%
If they win they have 80% to continue
If they have two wins (or more in a row), they have 90% to continue
If they lose, they have 60% to continue
If they lose twice (or more in a row) they have 40% to continue
If they get a bonus of $10, they consider it as a win
Create an algorithm that will best alocate bonuses to maximize revenue
Impelement it using Spark streaming and provide revenue based on 1000 players and maximum of 1000 games per user. The winner will be the one with best revenue.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
SparkBatch.py		SparkBatch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

mosheka/python_spark_lab

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages