Skip to content

Text stream analysis. Technologies used : Spark, Kafka, Cassandra, Web technologies

Notifications You must be signed in to change notification settings

Lal4Tech/DeepTextStreamAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

Deep Text Stream Analysis

Introduction

Extracting valid, novel, useful/actionable, understandable information from large amount of data is always having significance in various domains. The data can come from various sources, can have various forms, structured or unstructured and can be either static or stream. Collecting, storing, pre-processing, analyzing and communicating the results bring lot of challenges. It’s observed that the methods used at each stages of processing vary based on the behaviour of data.

In this project we analyze stream of text data. We provide emphasis for both pre processing and analysis stage. Also, possibly implement visualization showcasing different statistics and analysis results.

System Architecture

The architecture of system is given below alt text

The system contains following main components.

  • Stream generator
  • Message broker
  • Master application
  • Sentiment analyzer
  • Data storage

The stream generator reads text data from dataset[1] and create a JSON string in the form of following sample format. { id = "1467815924", date = "Mon Apr 06 22:19:49 PDT 2009", text = "@alielayus I want to go to promote GEAR AND GROOVE but unfortunately no ride there..." }

Then this JSON will send to the message broker - Kafka. The master application receives this message and convert the underlying RDDs to Dataframes. As part of preprocessing we remove URLs from text as it’s seldom contribute to analysis. In this step a new column named formatted_text is added to the dataframe which is then passed to sentiment analysis stage. For sentiment estimation, we are using Stanford NLP[2] which use Recursive Neural Tensor Networks and the Sentiment Treebank. Using user defined function[UDF], we integrate the sentiment analysis to this spark streaming application. The sentiment is get added to the dataframe and get pushed to Cassandra database.

For analytics purpose, Python notebook is used. In which basic analysis is shown. It reads data stored from Cassandra database and transform it into pandas data frame to carryout operation.

Tools

  • Big Data: Apache Spark, Cassandra, Kafka
  • Development tools: Scala, Python
  • Natural Language Processing Algorithms : Stanford NLP

Data Set

Source: Sentiment140 dataset with 1.6 million tweets [3]

Running the application

Kafka

//export environment variables

export KAFKA_HOME="/usr/local/kafka"

export PATH=$KAFKA_HOME/bin:$PATH

//start zookeeper

$KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties

//start kafka server

$KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties

//create Kafka topic

$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic text_anlyz

Cassandra

//export environment variables

export CASSANDRA_HOME="/usr/local/cassandra"

export PYTHONPATH="/home/harilal/anaconda2/bin/python"

export PATH=$PYTHONPATH/bin:$CASSANDRA_HOME/bin:$PATH

//Start Cassandra in the foreground

$CASSANDRA_HOME/bin/cassandra -f

//Start the cqlsh prompt

$CASSANDRA_HOME/bin/cqlsh

//Create keyspace

create keyspace textanlyz_space with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1};

//Create table

use textanlyz_space;

CREATE TABLE IF NOT EXISTS textanlyz_space.txt_anlyz_stats (id text PRIMARY KEY, date text, text_data text, formatted_text text, sentiment text);

desc txt_anlyz_stats;

//Check table content

select * from txt_anlyz_stats;

select id, date, text_data, sentiment from txt_anlyz_stats limit 10;

Stream

//Generate streaming input

cd Project/DeepTextStreamAnalyzer/src/main/generator

sbt run

Application

//Run the application

cd Project/DeepTextStreamAnalyzer/src/main/analyzer

sbt run

Analytics Notebook

Install Cassandra Python driver before running notebook commands.

References

[1].“Sentiment140’ - A Twitter Sentiment Analysis Tool, Sentiment140, help.sentiment140.com/for-students.

[2]. Socher, Richard, et al. "Recursive deep models for semantic compositionality over a sentiment treebank." Proceedings of the 2013 conference on empirical methods in natural language processing. 2013.

[3]. Go, A., Bhayani, R. and Huang, L., 2009. “Twitter sentiment classification using distant supervision. CS224N Project Report”, Stanford, 1(2009), p.12.

About

Text stream analysis. Technologies used : Spark, Kafka, Cassandra, Web technologies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •