This Big Data system analyze the tweets, in Real Time, by applying NLP algorithms. The application will bring us insights about a specific subject,or theme.
For example, the app can analyze the sentiment (negative, positive, neutral) of a set of tweets that concern a specific topic or brand or personality.
Reliable and scalable, this system operates in a fully distributed environment.
The app is built in a scalable system, using the tools below:
- apache kafka (for the data ETL and streaming data source parts)
- apache Spark (for the data processing (NLP))
- Spark NLP (John Snow Labs)
- HDFS (hadoop) (to store the App jar file, and others files (third jar files, NLP models, etc) required to deploy the app
- MongoDB (to store the tweets, and the machine Learning computation results)
- zeppelin (data visualization)
- ECLIPSE (as IDE)
in terms of computing resources, we can deploy the app on
- local mode (using the spark cluster (standalone mode), app depends of local machine)
- cluster mode (mesos cluster (using zookeeper quorum)
The app is written in Scala language
click here to enlarge the schema
- Kafka Connect: source (Twitter API) and sink connectors (MongoDB)
- Mongo DB collections
- Eclipse IDE project (using Maven (POM.xml file))
- Apache Spark (Spark SQL, Spark Streaming, Spark ML, Spark NLP)
- HDFS (folders system for the app)
- Zeppelin (MongoDB interpreter to read data stored in collections)
- MESOS resource manager (if cluster mode deployment) (cluster is built on Aws EC2 instances)