Built an ETL(Extract, Transform, Load) pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. Created different cassandra data models optimized for different queries those we needed to execute.
event_dataThis folder contains all thecsvfiles, these are the source of ETL pipelinne. All the data was provided by Udacityevent_datafile_new.csvThis is the output file of ETL pipeline. This file is used to model and insert data into Apache Cassandra tables.Project_1B_ Project_Template.ipynbhas the code for ETL and Cassandra data modeling
- Apache Cassandra
- CQL
- Python
- Jupyter NoteBook
- cassandra
- os
- glob
- csv
- prettytable
- ETL: Build an ETL pipeline that transform data from a set of CSV files within a directory to create a streamlined CSV file that can be used to model and insert data into Apache Cassandra tables.
- Data Modeling: Based on ETL output file and the queries we need to run model different Apache Cassandra data model those will be optimized and give exepected output. Main important thing is choosing
Partition KeyandClustering Columnsproperly so that data will be evenly distributed and appropriate rows will be fetched for the executed query.