is459-assignment2

Problem Statement

The Israel-Palestine conflict has had a lot of buzz lately due to the countless war crimes and inhumane acts by the armed forces.

As bystanders, particularly from other countries, might have less empathy for their situation and might not understand what is going on and why is it happening.

We turn to social media such as forums like Reddit to find out more, it is hence essential to identify unreliable threads online so that we are not biased.

Proposed Solution

Big data pipeline combined with NLP to tag posts and also provide insights on the topic as a whole.

Batch data ingestion and data processing (Daily).

Architecture

How to set up the project

Clone the repository
cd backend/lambda/lambda_layer directory, and run the following commands:

chmod +x build.sh
./build.sh

📝 Note: This can be automated in a CI/CD pipeline, it will create a zip file containing the python dependencies needed for the lambda functions to run, to add more libraries, add them to the requirements.txt file and run the build.sh script again.

Under the terraform directory (change directory there!), create a terraform.tfvars file with the following content:

AWS_ACCESS_KEY_ID     = "your_aws_access_key_id"
AWS_SECRET_ACCESS_KEY = "your_aws_secret_access_key"
AWS_ACCOUNT_ID        = "your_aws_account_id"
AWS_REGION            = "your_aws_region"

Run the following commands:

cd terraform
terraform init
terraform apply

After the terraform script is done, you will have to go to the AWS console and configure the Reddit credentials (client_id, client_secret, password, username, user_agent) at Secrets Manager.

📝 Note: An alternative to secrets manager is to pass the secrets via the CI/CD pipeline directly into the environment variables of the lambda functions.

After the infrastructure is set up, you can either wait for the pipeline to run at 0000 hrs, or you can manually trigger the pipeline by going to the AWS console, and triggering the pipeline manually (Invoke Lambda function -> Run Glue Crawlers)
Once the data is ready, you can query the data using Athena or QuickSight.
- If using Athena, note that the data is partitioned by year, month, and day.
The prepare_s3_manifest.py script is used to generate the manifest file for the crawler to crawl the data in S3. The manifest file can be found in the data bucket in the manifests folder and you will have to copy the link to the manifest file and paste it when configuring QuickSignt.
Ensure that the QuickSight user has access to the data bucket in S3.

Costs

Do note that costs will be incurred when running the pipeline, as the glue crawler will crawl the data in S3 and there will be a glue job run which will be charged. Link to AWS cost calculator: https://calculator.aws/#/

Future Works

Hopefully, Reddit will be able to embed the sentiments into their UI for posts and comments
Make the pipeline more generic so that it can be used for other topics
Make the pipeline real-time
Add other data sources such as Twitter, Facebook, Instagram, etc.
Add more NLP features such as sentiment analysis, topic modelling, etc.
Add more visualizations in QuickSight or other visualisations tools such as Tableau, PowerBI, etc.
Hopefully, the conflict will end soon.

Credits

Followed this guide to set up Reddit API: https://rymur.github.io/setup

Used PRAW Python library to interact with Reddit API: PRAW documentation: https://praw.readthedocs.io/en/stable/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
architecture		architecture
backend		backend
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

is459-assignment2

Problem Statement

Proposed Solution

Architecture

How to set up the project

Costs

Future Works

Credits

About

Releases

Packages

Languages

aloysiusng/Reddit-big-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

is459-assignment2

Problem Statement

Proposed Solution

Architecture

How to set up the project

Costs

Future Works

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages