Skip to content

To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance. Use PySpark or SQL to perform a statistical analysis of selected data.

Notifications You must be signed in to change notification settings

BharatGuturi/Big-Data-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Big-Data-ETL

To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance.

Instructions

  1. To execute the project follow the below commands for the files "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery": git clone https://github.com/BharatGuturi/Big-Data-ETL.git

  2. Create an AWS account.

  3. Create RDS and S3 in AWS.

  4. Connect google colab with the ipynb files.

  5. S3 bucket setting: { "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "getobject",
    "Effect": "Allow",
    "Principal": "",
    "Action": "s3:GetObject",
    "Resource": "/
    "
    } ] }

  6. Install the required libraries using the following commands: pip install pypandoc pip install pyspark

  7. In the following command in the 'Load' section, use the endpoint created in AWS in and database name in <database_name>. In the config command, use the username and password used to create the database

    jdbc_url="jdbc:postgresql://:5432/<database_name>"

    config = {"user":"", "password": "", "driver":"org.postgresql.Driver"}

  8. Run the files: "part_one_reviews_us_Software" and "part_one_reviews_us_Grocery"

About

To perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance. Use PySpark or SQL to perform a statistical analysis of selected data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published