This project provisions and runs an AWS data pipeline that:
- Ingests a CSV file uploaded to Amazon S3
- Triggers an AWS Lambda function on
ObjectCreatedevents - Starts an AWS Glue ETL job
- Transforms and loads the data into a DynamoDB table
- Load size achieved: 5.6 million rows successfully written into Amazon DynamoDB.
- End-to-end flow: Upload CSV → S3 event → Lambda → Glue ETL → DynamoDB items.
Note: Final load time and throughput depend on AWS region, Glue worker count, DynamoDB account limits, and dataset size.
createService.py: One-click infrastructure setup. Creates:- S3 bucket
- IAM roles (Lambda role + Glue role)
- Glue job (ETL)
- Lambda function (S3 trigger → Glue start)
- S3 event notification (prefix
raw/and suffix.csv) - Uploads the Glue ETL script to S3 under
scripts/…
uploadDatasetToS3.py: Uploads the local CSV to S3 underraw/…to trigger the pipeline.deleteService.py: Teardown script. Removes S3 notifications and permissions, then deletes:- Lambda function
- Glue job
- IAM roles
- (Optionally) DynamoDB table
- S3 bucket (and all objects in it)
dataset/: Local CSV input (ignored by git).extra/: Misc/backup files (ignored by git exceptextra/dataset link.txt).
- Create S3 bucket
- Upload Glue ETL script to
s3://<bucket>/scripts/... - Create IAM roles
- Lambda role: permissions for logs + starting Glue job + reading S3 objects
- Glue role: permissions to read from S3 and write to DynamoDB (and Glue service role access)
- Create Glue job pointing at the uploaded ETL script
- Create Lambda function with environment variables:
GLUE_JOB_NAMEDYNAMODB_TABLE_NAME
- Configure S3 event trigger
- Event:
s3:ObjectCreated:* - Filter: prefix
raw/, suffix.csv
- Event:
- Upload a CSV to
s3://<bucket>/raw/<file>.csv - S3 emits an event → Lambda is invoked
- Lambda calls Glue
StartJobRunand passes:--S3_BUCKET--S3_KEY--DYNAMODB_TABLE
- Glue ETL job:
- Creates the DynamoDB table if it does not exist (partition key:
id) - Reads the CSV from S3
- Adds metadata columns (timestamp, source file)
- Generates
idif missing, drops duplicates, then writes to DynamoDB
- Creates the DynamoDB table if it does not exist (partition key:
- Input format: CSV with header row.
- Upload location:
s3://<bucket>/raw/ - Trigger rule: only objects that match:
- Prefix:
raw/ - Suffix:
.csv
- Prefix:
The Glue job writes items to a single DynamoDB table:
- Table name: configured in
createService.py(DYNAMODB_TABLE_NAME) - Partition key:
id(string) - Billing mode:
PAY_PER_REQUEST(on-demand) - Additional attributes:
ingestion_timestamp: load time metadatasource_file: S3 key that produced the record- Other CSV columns: written as DynamoDB string attributes (via Glue mapping)
If the source CSV does not contain an id column, the ETL generates a deterministic ID by hashing the row contents.
- S3 bucket:
- Creates the bucket (or reuses it if it already exists)
- Uploads the Glue ETL script to
scripts/…
- IAM roles:
- Lambda role: CloudWatch Logs +
glue:StartJobRun+s3:GetObject - Glue role: read S3 + write DynamoDB +
AWSGlueServiceRolemanaged policy
- Lambda role: CloudWatch Logs +
- Glue job:
- Glue version configured in the script (example: Glue 4.0)
- Worker configuration configured in the script (worker type + count + timeout)
- Lambda function:
- Triggered by S3
ObjectCreatedevents filtered toraw/*.csv - Starts the Glue job and passes S3 bucket/key + DynamoDB table name
- Triggered by S3
- Uploads the local CSV file to the configured S3 bucket/key under
raw/… - This upload is what triggers the pipeline
- Removes S3 notification configuration and the Lambda invoke permission
- Deletes Lambda, Glue job, IAM roles, DynamoDB table (optional), and S3 bucket contents + bucket
- Requires typing
deleteto confirm
- Python 3.x
- AWS credentials configured (AWS CLI, environment variables, or any standard
boto3credential method) - IAM permissions to create/manage:
- S3 bucket + bucket notifications
- Lambda function + permissions
- Glue job
- IAM roles + policies
- DynamoDB table (if enabled/used)
These values are hard-coded in the scripts and must match across files:
AWS_REGIONS3_BUCKET_NAMEGLUE_JOB_NAMELAMBDA_FUNCTION_NAMEDYNAMODB_TABLE_NAMECSV_UPLOAD_PREFIX(expected to beraw/)
If you change resource names in createService.py, update the same names in uploadDatasetToS3.py and deleteService.py.
Run:
python createService.pyRun:
python uploadDatasetToS3.py- Lambda logs: CloudWatch → Log groups →
/aws/lambda/<lambda-name> - Glue job: AWS Glue Console → Jobs →
<job-name> - DynamoDB table: DynamoDB Console → Tables →
<table-name>
- Lambda triggered but Glue didn’t start
- Check Lambda logs for missing environment variables (
GLUE_JOB_NAME,DYNAMODB_TABLE_NAME) - Confirm the Lambda IAM role allows
glue:StartJobRun
- Check Lambda logs for missing environment variables (
- Upload doesn’t trigger Lambda
- Confirm the object key is under
raw/and ends with.csv - Confirm the bucket notification configuration includes the Lambda ARN and filter rules
- Confirm the object key is under
- Glue job fails reading CSV
- Confirm the CSV has a header row
- Confirm Glue role has
s3:GetObjectpermissions for the bucket
- DynamoDB throttling / slow writes
- With on-demand tables, spikes may still be throttled depending on account limits
- Reduce Glue parallelism or adjust DynamoDB write options in the ETL mapping/writer
Running this pipeline may incur charges for:
- AWS Glue (workers and duration)
- Lambda invocations (typically small)
- DynamoDB writes/storage (potentially significant at millions of items)
- S3 storage and requests
Run:
python deleteService.pyYou will be prompted to type delete to confirm.
Warning: This permanently deletes the S3 bucket (including all objects) and other pipeline resources.
