Skip to content

tbm077861/Big-Data-Subject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Big-Data-Subject

Projects done while studying big data at IUH

Projects Done While Studying Big Data at IUH

This repository contains various projects completed while studying Big Data at Ho Chi Minh City University of Industry (IUH). Each project applies different big data technologies, frameworks, and methodologies to solve real-world problems.

Table of Contents

Overview

During my studies at IUH, I explored multiple aspects of Big Data, including data collection, storage, processing, analysis, and visualization. These projects demonstrate my understanding and application of big data tools and concepts.

Projects

  1. Web Crawling & Data Scraping

    • Built a crawler to extract data from various news websites using BeautifulSoup and Scrapy.
    • Processed and stored data in CSV and MongoDB for further analysis.
  2. Data Processing with Hadoop & Spark

    • Used Apache Hadoop (HDFS, MapReduce) and Apache Spark (PySpark) to process large datasets.
    • Performed ETL (Extract, Transform, Load) operations on structured and unstructured data.
  3. Real-Time Data Streaming

    • Implemented a real-time data processing pipeline using Apache Kafka and Spark Streaming.
    • Analyzed and visualized live data streams from Twitter and IoT sensors.
  4. Machine Learning on Big Data

    • Built predictive models using ML algorithms in Scikit-Learn and Spark MLlib.
    • Applied classification, regression, and clustering on large datasets.
  5. Big Data Visualization

    • Created interactive dashboards with Tableau and Power BI.
    • Used Matplotlib, Seaborn, and D3.js for data visualization.

Technologies Used

  • Programming Languages: Python, Java, Scala
  • Big Data Frameworks: Hadoop, Spark, Kafka
  • Databases: MongoDB, MySQL, PostgreSQL
  • Web Scraping Tools: BeautifulSoup, Scrapy
  • Machine Learning: Scikit-learn, TensorFlow, Spark MLlib
  • Visualization Tools: Tableau, Power BI, Matplotlib, Seaborn, D3.js

Installation & Setup

To run the projects locally:

  1. Clone the repository:
    git clone https://github.com/your-username/big-data-projects.git
    cd big-data-projects
  2. Set up a virtual environment (optional but recommended):
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:
    pip install -r requirements.txt

Usage

Each project has its own folder with detailed instructions. Navigate to a specific project and follow the README inside for setup and execution.

Contributing

Contributions are welcome! Feel free to submit issues or pull requests to improve the projects.

License

This repository is licensed under the MIT License. See LICENSE for details.

About

Projects done while studying big data at IUH

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published