GitHub Methodology for FDA Medical Device Recall Analysis with File Execution Process
- dags/etl/load:
- Contains scripts for pulling raw data from the FDA Open API.
- Run
extract_data.py
to initiate the data extraction process.
- dags/etl/transformation:
- Stores configurations and scripts for storing raw data in Amazon S3.
- Run
store_in_s3.py
to securely store raw data in Amazon S3.
- dags/etl/sparktransformation:
- Dockerized scripts for isolating and running the data transformation process.
- Execute
docker-compose up
in /data_transformation to run the transformation.
- docker-compose:
- Utilizes Apache Airflow for managing the workflow, including scheduling and monitoring.
- Use Airflow's web interface to monitor and trigger workflows.
- Execute
python data_ingestion.py
in /data_extraction to pull raw data from the FDA Open API. - Ensure proper API authentication and handle potential API rate limits.
- Monitor extraction logs for any errors.
- Run
python load.py
in /data_storage to store raw data securely in Amazon S3. - Implement error handling and logging to capture storage failures.
- Verify data integrity in the S3 bucket.
- Use Docker to create isolated environments for running data transformation scripts.
- Execute
docker-compose up
in /data_transformation to initiate the transformation process. - Monitor Docker logs for any dependencies or runtime issues.
- Document dependencies and configurations within the Docker containers.
- Utilize Apache Airflow for orchestrating the data processing workflow.
- Access the Airflow web interface to monitor task execution and history.
- Schedule workflows using Airflow's scheduler.
- Monitor the transformed data in Amazon Redshift for efficient querying.
- Run necessary data quality checks to ensure integrity.
- Implement error handling for potential issues during transformation.
- Use Amazon EMR for large-scale data processing if needed.
- Execute EMR clusters with appropriate configurations.
- Monitor EMR cluster performance and costs.
- Use Tableau to create interactive dashboards in /visualization.
- Connect Tableau to Amazon Redshift for real-time data visualization.
- Update Tableau visualizations with the latest data.
- Maintain thorough documentation in README files for setup, configuration, and usage of each module.
- Include step-by-step guides for running scripts and processes.
- Document any known issues and troubleshooting steps.
- Utilize Git for version control, creating branches for feature development and bug fixes.
- Use GitHub Actions for continuous integration and automated testing.
- Encourage collaboration through pull requests and utilize GitHub Issues for tracking tasks.
https://docs.google.com/presentation/d/1h6vaT9aVDvd0NMPH7OBFYhO0k0P7u95ml7VCzx1z-3s/edit?usp=sharing
- Charan Kanwal Preet Singh
- Shouvik Sengupta
- Sushil R Deore
- Shivam Sawhney