This project streams/ingest Twitter feed using Flume. The tweets are stored in a Hive data lake using Avro format. This data can be cleansed using tools like OpenRefine, Pig etc. The cleansed data can then be used for visualization.
- twitter.conf is used to store all the configurations required for ingesting tweets
- TwitterDataAvroSchema.avsc contains the avro schema.
- avrodataread.q is used to create a staging table with avro Serde.
- create_tweets_avro_table.q is used to create a processing table with well defined DDL's.
To run this software you need the following:
- Linux
- Hadoop 2.0
- Hive 2.0
- Flume
- Twitter Developer App Credentials
-
Get credentials for developing twitter apps.
-
Write a twitter.conf file and replace the variables with your secret keys given by twitter.
-
Execute twitter.conf in the terminal
flume-ng agent -n TwitterAgent -f $FLUME_CONF_DIR/twitter.conf
-
Get the schema from the avro log file
hdfs dfs -cat /user/flume/tweets/FlumeData.* | head
-
Copy and then save the schema in a file called
TwitterDataAvroSchema.avsc
-
Edit the file for readability.
-
Write a hql file called
avrodataread.q
to create table tweets using the AvroSerDe, mention the avro schema file in the tblproperties. -
Execute the file in terminal
hive -f FlumeHiveTwitterApp/Hive scripts/avrodataread.q
-
To create a table for processing or for visualization, use the file named
create_tweets_avro_table.q
and execute it.hive -f FlumeHiveTwitterApp/Hive scripts/create_tweets_avro_table.q
-
Clean using tools like pig, OpenRefine etc.
-
Visualize the data into a dashboard using tools like tablaeu, d3.js etc.