Questions to be solved in this Assignment
- Get data from Stack Exchange Acquire the top 200,000 posts by viewcount (see notes on Data Acquisition)
- Load them with PIG Using Pig or MapReduce, extract, transform and load the data as applicable
- Query them with Hive Using Hive and/or MapReduce, get: I. The top 10 posts by score II. The top 10 users by post score III. The number of distinct users, who used the word “Hadoop” in one of their posts
- Calculate TF-IDF with MapReduce (Note: plenty of versions of code online in both Java and Python, just acknowledge the source and the changes you had to do to it) Using Mapreduce calculate the per-user TF-IDF (just submit the top 10 terms for each of the top 10 users from Query 3.II)