Projects done while studying big data at IUH
This repository contains various projects completed while studying Big Data at Ho Chi Minh City University of Industry (IUH). Each project applies different big data technologies, frameworks, and methodologies to solve real-world problems.
During my studies at IUH, I explored multiple aspects of Big Data, including data collection, storage, processing, analysis, and visualization. These projects demonstrate my understanding and application of big data tools and concepts.
-
Web Crawling & Data Scraping
- Built a crawler to extract data from various news websites using
BeautifulSoup
andScrapy
. - Processed and stored data in CSV and MongoDB for further analysis.
- Built a crawler to extract data from various news websites using
-
Data Processing with Hadoop & Spark
- Used Apache Hadoop (HDFS, MapReduce) and Apache Spark (PySpark) to process large datasets.
- Performed ETL (Extract, Transform, Load) operations on structured and unstructured data.
-
Real-Time Data Streaming
- Implemented a real-time data processing pipeline using Apache Kafka and Spark Streaming.
- Analyzed and visualized live data streams from Twitter and IoT sensors.
-
Machine Learning on Big Data
- Built predictive models using ML algorithms in
Scikit-Learn
andSpark MLlib
. - Applied classification, regression, and clustering on large datasets.
- Built predictive models using ML algorithms in
-
Big Data Visualization
- Created interactive dashboards with
Tableau
andPower BI
. - Used
Matplotlib
,Seaborn
, andD3.js
for data visualization.
- Created interactive dashboards with
- Programming Languages: Python, Java, Scala
- Big Data Frameworks: Hadoop, Spark, Kafka
- Databases: MongoDB, MySQL, PostgreSQL
- Web Scraping Tools: BeautifulSoup, Scrapy
- Machine Learning: Scikit-learn, TensorFlow, Spark MLlib
- Visualization Tools: Tableau, Power BI, Matplotlib, Seaborn, D3.js
To run the projects locally:
- Clone the repository:
git clone https://github.com/your-username/big-data-projects.git cd big-data-projects
- Set up a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Each project has its own folder with detailed instructions. Navigate to a specific project and follow the README inside for setup and execution.
Contributions are welcome! Feel free to submit issues or pull requests to improve the projects.
This repository is licensed under the MIT License. See LICENSE
for details.