I implemented a sentiment analysis model on Twitter using Apache Spark. I used FastText embeddings and deep learning RNN models (LSTM, GRU, and CNN) with Analytics Zoo library. Also, this work included a pre-processing framework based on Dataframe that performs much better than RDD-based architectures in terms of processing time and volume of data that can be processed. In addition, I used MongoDB and Apache Cassandra as this model's databases and compared them to the Apache Spark file storing and retrieving system.
We also published an article for introducing a Dataframe based pre-processing framework that you can get from here: https://jad.shahroodut.ac.ir/article_2394.html
I hope this will be useful for you ;)
- Importing libraries (Probably you will need to install some of them such as
Analytics Zoo
andfindspark
) - Initialize Apache spark cluster
- Import and reading sentiemnt140 dataset with pandas. (You will need to change dataset's path)
- Import FastText embeddings with gensim
- Pre-processing tweets including cleansing, tokening, padding and vectorizing (This step is implemented in two ways: RDD-based and Dataframe-based)
- Configuration of Apache Cassandra and MongoDB on Apache Spark
- Sentiment Analysis models