Tweets-Analysis-in-Hadoop

Analyze a large file of tweet data in JSON format using Hive in Hadoop.

Note: I was given the JSON file (tweet). Check my Sentiment Analysis with R to see how I scrape online reviews with R. Use process.py to transform the tweet data to json format that can use json serde Tweet-json-hive is actually hive language, not java. I just saved in as java so it can appear all colorful in my Sublime Text

Some interesting things to share:

Look at the data carefully It was a huge file with thousands of rows and dozens of attributes for each tweet. I even discovered 2 other types of tweets (retweeted and quoted) that were nested inside the direct tweets - almost miss them
Non-character string can mess up your result (like "\n" for new line) Look at the Null values, find the inconsistency in the review (that "\n") and use regexp_replace to replace it.
Difference between create table and create view Create view indeed saves me a lot of processing time (especially with Hive) for non-key data tables.
SQL, SQL, and SQL At the end, Hive's syntax is very similar to SQL. I'm glad to re-touch several SQL functions that I just "know" before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tweets-Analysis-in-Hadoop

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tweets-Analysis-in-Hadoop