Welcome to my GitHub profile! I am a dedicated Data Engineer with hands-on experience building scalable data pipelines, integrating generative AI into production workflows, and optimizing complex ML architectures. I love tackling real-world problems with data-driven insights, focusing on performance, reliability, and compliance.
- Languages & Programming: Python, R, SQL, Java, C++, Scala
- Data Engineering & Distributed Systems: PySpark, Hive, Spark, Kafka, Airflow, Flink, Docker, Kubernetes, Jenkins, Git
- Generative AI & ML: Diffusion Models, Large Language Models (LLMs), CNN-RNN Architectures, Self-Attention, TensorFlow, PyTorch
- Cloud Services: AWS (Lambda, Rekognition, SageMaker, Glue), Azure Databricks, GCP (BigQuery, Vertex AI)
- Databases & Data Warehouses: Snowflake, Redshift, BigQuery, MySQL, PostgreSQL, DynamoDB, HBase
- Other Tools: Pentaho-ETL, Presto, Grafana, Datadog
- Overview: Built a CNN-RNN-based captioning model enhanced with self-attention mechanisms to improve feature representation.
- Performance: Leveraged mixed-precision (FP16) training and HPC resources, cutting memory usage by 40% and reducing training time by 50%.
- Key Achievements:
- +15% improvement in BLEU, CIDEr, METEOR, and ROUGE scores.
- Real-time captioning deployed on Kubernetes with microservices architecture for high throughput and robust performance.
- Overview: Implemented a Latent Diffusion Model (LDM) by integrating U-Net, VAE, and CLIP for high-fidelity image generation.
- Performance: Fine-tuned CLIP for better text-image alignment, improving FID by 20% and achieving sub-100ms latency via ONNX/TensorRT.
- Key Achievements:
- Reduced training compute costs by 50% with FP16, DDP, and gradient checkpointing.
- Built an efficient data pipeline (DiffusionDB, Parquet) to minimize I/O bottlenecks by 60%.
- Generative AI Pipelines: Expanded a unified data lakehouse on Apache Iceberg and Snowflake for large-scale model training on text, audio, and image data.
- Real-Time Processing: Built real-time distributed data architecture with Apache Flink and Kafka, ensuring near-instant event processing and reducing latency by 60%.
- AWS Rekognition Integration: Implemented real-time triggers with Lambda and Rekognition for image/video analysis, improving data verification accuracy by 25%.
- ETL Optimization & Cloud Migration: Migrated legacy Pentaho workflows to Flink, cutting report generation time by 70% and maintaining 99.9% uptime.
- Dynamic Pricing & Revenue Growth: Deployed a dynamic pricing model on SageMaker, boosting annual revenue by 35% and increasing customer satisfaction by 20%.
- Governance & Compliance: Enforced HIPAA compliance in Snowflake/Iceberg, strengthening stakeholder trust and mitigating regulatory risks.
- Legacy Pipeline Modernization: Migrated a legacy SCD2 pipeline to a modern tech stack with Firebase Authentication and GCP Nearby-Search API.
- API Performance: Improved API response times by 30%, supporting 10,000+ daily searches without compromising reliability.
- BigQuery Integration: Orchestrated real-time patient data in BigQuery, ensuring high performance and data-protection compliance.
Northeastern University
M.S. in Computer Software Engineering (Expected May 2025)
- Relevant Coursework: Generative AI, High Performance Parallel Compute with Deep Learning, Big Data and Indexing
- Activities & Achievements: Co-founder at CareWallet (Healthcare AI Startup), Project Lead at Google Developer Student Club
Ganpat University
B.S. in Computer Science and Engineering, Major in Big Data Analytics (July 2018 – May 2022)
- Relevant Coursework: Probability & Statistics, Advanced Cloud Computing, Advanced Big Data Analytics
- Activities & Achievements: Project Lead at Google Cloud Study Jam, 2x GCP Quest Leader at Google Cloud
- MLOps & Kubernetes: Automating end-to-end ML pipelines, including model versioning, deployment, and monitoring at scale.
- Generative AI: Further refining LLMs and latent diffusion models for text-to-image and text-to-audio synthesis.
- Real-Time Analytics: Experimenting with Apache Flink SQL for streaming data transformations and analytics.
- Email: patel.shrey4@northeastern.edu
- LinkedIn: linkedin.com/in/shreypatel4/
- GitHub: github.com/ShreyPatel4
- Location: Foster City, CA, USA
I love combining my passion for photography with data visualization—capturing moments both in real life and through interesting analytics projects.
Last Updated: February 2025