Skip to content

Latest commit

 

History

History
39 lines (27 loc) · 1.27 KB

README.md

File metadata and controls

39 lines (27 loc) · 1.27 KB

Big-Data-ETL

To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance.

Instructions

  1. To execute the project follow the below commands for the files "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery": git clone https://github.com/BharatGuturi/Big-Data-ETL.git

  2. Create an AWS account.

  3. Create RDS and S3 in AWS.

  4. Connect google colab with the ipynb files.

  5. S3 bucket setting: { "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "getobject",
    "Effect": "Allow",
    "Principal": "",
    "Action": "s3:GetObject",
    "Resource": "/
    "
    } ] }

  6. Install the required libraries using the following commands: pip install pypandoc pip install pyspark

  7. In the following command in the 'Load' section, use the endpoint created in AWS in and database name in <database_name>. In the config command, use the username and password used to create the database

    jdbc_url="jdbc:postgresql://:5432/<database_name>"

    config = {"user":"", "password": "", "driver":"org.postgresql.Driver"}

  8. Run the files: "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery"