By analyzing a massive collection of education-related tweets, the project explores whether higher tweet volumes correspond to significant trends in the education sector.
- Python Programming Language
- PySpark
- Google Cloud Platform
- Big Data Analysis
- Performing twitterer identification, location analysis, timeline analysis and tweets uniqueness.
1. Data Collection and Preprocessing:
- The dataset was given by the University that I am studying in (University of Chicago). It consists of ~100 million Tweets (~500GB). These tweets are collected on the topics of education, schools, universities, learning, knowledge sharing, etc., but only a fraction of them would be directly related to either primary, secondary or higher education.
- Combine individual JSON files and process them for analysis.
- Discard irrelevant tweets to focus on education-related content.
2. Exploratory Data Analysis (EDA):
- Conduct a comprehensive EDA to identify key variables suitable for profiling Twitter users.
- Identify fields that provide insights into message volume, retweets, and more.
- Discard poorly populated variables to streamline analysis.
3. Perform Analysis on following topics :
- Author identification
- Geographical Distribution Analysis
- Timeline Analysis
- Message Uniqueness Analysis