Skip to content

sparktacusdemo/tweets_realtime_nlp_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tweets Real-Time NLP Analysis

Project's Presentation

Purpose

This Big Data system analyze the tweets, in Real Time, by applying NLP algorithms. The application will bring us insights about a specific subject,or theme. For example, the app can analyze the sentiment (negative, positive, neutral) of a set of tweets that concern a specific topic or brand or personality. Reliable and scalable, this system operates in a fully distributed environment.

Technical Environment

The app is built in a scalable system, using the tools below:

  • apache kafka (for the data ETL and streaming data source parts)
  • apache Spark (for the data processing (NLP))
  • Spark NLP (John Snow Labs)
  • HDFS (hadoop) (to store the App jar file, and others files (third jar files, NLP models, etc) required to deploy the app
  • MongoDB (to store the tweets, and the machine Learning computation results)
  • zeppelin (data visualization)
  • ECLIPSE (as IDE)

in terms of computing resources, we can deploy the app on

  1. local mode (using the spark cluster (standalone mode), app depends of local machine)
  2. cluster mode (mesos cluster (using zookeeper quorum)

The app is written in Scala language

Workflow

alt text

click here to enlarge the schema

Points to set

  • Kafka Connect: source (Twitter API) and sink connectors (MongoDB)
  • Mongo DB collections
  • Eclipse IDE project (using Maven (POM.xml file))
  • Apache Spark (Spark SQL, Spark Streaming, Spark ML, Spark NLP)
  • HDFS (folders system for the app)
  • Zeppelin (MongoDB interpreter to read data stored in collections)
  • MESOS resource manager (if cluster mode deployment) (cluster is built on Aws EC2 instances)

Zeppelin Notebook

dashboard

About

kafka, Mongodb, SparkNLP, Zeppelin

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages