Twiddit

Use reddit comments dataset to classify the twitter messages to the subreddit topics. The tweet is classified to a subreddit topic with the highest term frequency–inverse document frequency (tf-idf) metric.

Demo

https://www.youtube.com/watch?v=WyDrYVslKXY (or http://twiddit.site)

Purpose and Use Cases

Knowing what people are talking about and the people's interests are useful for digital marketing and social media monitoring.

Proposed Architecture

See the slides for the data pipeline: https://docs.google.com/presentation/d/1-tYs0eIeKXNV5WvOdctHIi5myJ9o4_DX8Mx5SVk2wug/edit#slide=id.g306586f75a_0_725 (or http://twiddit.site/slides)

Technologies well-suited to solve the challenges

Spark - batch process reddit dataset to generate term-frequency table for every word and every subreddit.

Spark Streaming - stream process the twitter message and classify.

Kafka - handle the twitter streaming messages.

Cassandra - a database with high availability for Spark Streaming to query the term-frequency table and class.

The primary engineering challenges

Streaming process tiwtter message at the rate of 1000/sec. Classify every tweet into over 34,000 subreddit topics.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
kafka		kafka
spark		spark
spark_streaming		spark_streaming
web		web
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twiddit

Demo

Purpose and Use Cases

Proposed Architecture

Technologies well-suited to solve the challenges

The primary engineering challenges

About

Releases

Packages

Languages

eric810905/twiddit

Folders and files

Latest commit

History

Repository files navigation

Twiddit

Demo

Purpose and Use Cases

Proposed Architecture

Technologies well-suited to solve the challenges

The primary engineering challenges

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages