Playground space for BigData projects
Projects were generally run using Debian WSL (Windows SubSystem for Linux)
If project depends on systemd check : https://devblogs.microsoft.com/commandline/systemd-support-is-now-available-in-wsl/
List of the technologies and tools for big data projects:
NOTE: Added (<num_jobs>) next to the tool name as reference (metric) to establish an orden (prioritize) based on job market demand. (just for some)
Related Programming Languages Python, Scala, Java
-
Data Ingestion and Collection:
- Apache Kafka (778) : A real-time data streaming platform for event sourcing.
- Apache NiFi (34) : An integrated data logistics platform for automating data movement.
- Apache Flume (17): Collects, aggregates, and moves large volumes of data.
-
Data Storage:
- Hadoop Distributed File System (HDFS): A distributed file system for storing large datasets.
- Amazon S3: Cloud-based object storage by Amazon Web Services.
- Azure Data Lake Storage: Cloud-based data storage by Microsoft Azure.
- Apache HBase: A NoSQL database for real-time data access.
- Cassandra: A highly scalable NoSQL database.
- MongoDB: A document-oriented NoSQL database.
-
Data Processing and Batch Processing:
- Apache Hadoop MapReduce: Batch processing and distributed data processing framework.
- Apache Spark: Fast in-memory data processing, batch, and real-time stream processing.
- Apache Hive: SQL-based querying and analysis.
- Presto: Distributed SQL query engine.
- Apache Pig: Data flow scripting and analysis.
-
Data Processing and Stream Processing:
- Apache Kafka Streams (782) : Real-time stream processing and event-driven applications.
- Apache Flink (75) : Stream processing framework with low-latency and high-throughput capabilities.
-
Data Orchestration and Workflow:
- Apache Airflow (432): Workflow automation and scheduling.
- Kubeflow Pipelines (32): Management of machine learning workflows on Kubernetes.
- Apache Oozie (5): Job scheduling, workflow management, and automation.
- dagster : Dagster is an orchestrator that's designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
-
Machine Learning and AI:
- TensorFlow (330): Open-source machine learning framework.
- PyTorch (326): Deep learning framework.
- Scikit-learn (170): Machine learning library for Python.
- Apache Spark MLlib (2): Machine learning library integrated with Spark.
- Apache Mahout (1): Scalable machine learning and data mining.
-
Data Visualization and Reporting:
- Power BI (2199): Data analytics and visualization by Microsoft.
- Tableau (829): Data visualization and business intelligence tool.
- Looker (674): Data exploration and business intelligence platform.
- Dash (328): Python web application framework for building interactive data dashboards.
- Kibana (93): Data visualization and exploration for the Elastic Stack.
- Apache Superset (1): Data exploration and visualization platform.
-
Data Security and Governance:
- Apache Knox (3): Gateway for securing Hadoop clusters.
- Apache Ranger (): Access control, auditing, and data governance.
- Cloudera Navigator (): Data governance and management.
-
Containers and Orchestration:
- Docker (2706): Containerization of applications and services.
- Kubernetes (1975): Container orchestration and scaling.
-
Monitoring and Logging:
- Grafana (361): Metrics visualization and alerting.
- Prometheus (304): Metrics collection and monitoring.
- ELK Stack (Elasticsearch, Logstash, Kibana) (206): Log analysis and visualization.
- Kibana (91)
- Zabbix (53)
-
Resource Management:
- Kubernetes (1975): Container orchestration and resource management.
- Apache YARN (84): Cluster resource management and job scheduling.
- Apache Mesos (32): Cluster resource management.
-
Database and Querying Tools:
- Apache Hive (239): SQL-based querying and data warehousing.
- Presto (136): Distributed SQL query engine.
- Apache Impala (16): High-performance SQL queries on Hadoop.
These descriptions provide an overview of each technology's purpose and capabilities. You can follow the links to access their official documentation for more detailed information.