To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance.
-
To execute the project follow the below commands for the files "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery": git clone https://github.com/BharatGuturi/Big-Data-ETL.git
-
Create an AWS account.
-
Create RDS and S3 in AWS.
-
Connect google colab with the ipynb files.
-
S3 bucket setting: { "Version": "2012-10-17",
"Statement": [
{
"Sid": "getobject",
"Effect": "Allow",
"Principal": "",
"Action": "s3:GetObject",
"Resource": "/"
} ] } -
Install the required libraries using the following commands: pip install pypandoc pip install pyspark
-
In the following command in the 'Load' section, use the endpoint created in AWS in and database name in <database_name>. In the config command, use the username and password used to create the database
jdbc_url="jdbc:postgresql://:5432/<database_name>"
config = {"user":"", "password": "", "driver":"org.postgresql.Driver"}
-
Run the files: "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery"