Big-Data-ETL

To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance.

Instructions

To execute the project follow the below commands for the files "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery": git clone https://github.com/BharatGuturi/Big-Data-ETL.git
Create an AWS account.
Create RDS and S3 in AWS.
Connect google colab with the ipynb files.
S3 bucket setting: { "Version": "2012-10-17",
"Statement": [
{
"Sid": "getobject",
"Effect": "Allow",
"Principal": "",
"Action": "s3:GetObject",
"Resource": "/"
} ] }
Install the required libraries using the following commands: pip install pypandoc pip install pyspark
In the following command in the 'Load' section, use the endpoint created in AWS in and database name in <database_name>. In the config command, use the username and password used to create the database

jdbc_url="jdbc:postgresql://:5432/<database_name>"

config = {"user":"", "password": "", "driver":"org.postgresql.Driver"}
Run the files: "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery"