Use reddit comments dataset to classify the twitter messages to the subreddit topics. The tweet is classified to a subreddit topic with the highest term frequency–inverse document frequency (tf-idf) metric.
https://www.youtube.com/watch?v=WyDrYVslKXY (or http://twiddit.site)
Knowing what people are talking about and the people's interests are useful for digital marketing and social media monitoring.
See the slides for the data pipeline: https://docs.google.com/presentation/d/1-tYs0eIeKXNV5WvOdctHIi5myJ9o4_DX8Mx5SVk2wug/edit#slide=id.g306586f75a_0_725 (or http://twiddit.site/slides)
Spark - batch process reddit dataset to generate term-frequency table for every word and every subreddit.
Spark Streaming - stream process the twitter message and classify.
Kafka - handle the twitter streaming messages.
Cassandra - a database with high availability for Spark Streaming to query the term-frequency table and class.
Streaming process tiwtter message at the rate of 1000/sec. Classify every tweet into over 34,000 subreddit topics.