Skip to content

Tech Stack

Matthew Smith edited this page Dec 24, 2025 · 4 revisions

Core Framework

  • Python – Primary language (99.3% of repository)
  • Apache Spark – Distributed processing engine
  • Apache Kafka – Real-time streaming platform
  • Flask – Web server framework

Big Data & Streaming

  • PySpark – Python API for Spark 3.5.1
  • Spark SQL – Structured data processing
  • Spark Structured Streaming – Real-time data streams
  • Kafka Producer – Python Kafka client
  • Parquet – Columnar storage format

Machine Learning

  • Spark MLlib – Distributed ML library
  • Gradient Boosted Trees – Primary model
  • Random Forest – Ensemble learning
  • Logistic Regression – Baseline model
  • Pipeline – Preprocessing automation

Data Processing & Analysis

  • Pandas – Data manipulation & analysis
  • NumPy – Numerical computing
  • findspark – Spark initialization

Data Visualization

  • Matplotlib – Core plotting library
  • Seaborn – Statistical visualizations
  • Heatmaps – Feature relationships
  • Confusion Matrix – Model evaluation

Frontend Technologies

  • HTML5 – Web interface structure
  • CSS3 – Styling & layout
  • JavaScript – Interactive visualizations
  • Flexbox – Responsive layout

Database

  • MongoDB – NoSQL database (attempted)
  • Parquet – Production storage (used)

Development Tools

  • Jupyter – Interactive notebooks
  • VS Code – Code editor
  • Live Server – Local web hosting
  • Git – Version control
  • GitHub – Repository hosting

Cloud Platform

  • Google Cloud – VM deployment
  • Compute Engine – Virtual machines

Python Libraries

  • argparse – CLI argument parsing
  • json – Data serialization
  • csv – CSV file handling
  • glob – File pattern matching

Data Formats

  • CSV – Input dataset
  • JSON – Kafka messages
  • Parquet – Streaming output

Data Processing vs. Web Interface

Backend (Data Processing) Frontend (Web Interface)
Apache Spark Distributed Engine HTML5 Structure
Apache Kafka Streaming CSS3 Styling
PySpark Python API JavaScript Interactivity
Spark MLlib ML Framework Flask Web Server
Pandas Data Analysis Jinja2 Templating
NumPy Numerical Ops Flexbox Responsive Layout
Matplotlib Visualization Live Server Local Hosting
Seaborn Heatmaps PNG Images Pre-rendered
Parquet Storage
MongoDB Database (attempted)
kafka-python Producer
argparse CLI

ML Pipeline Components

Component Technology
Classification Models GBT Random Forest Logistic Regression
Feature Engineering StringIndexer OneHotEncoder
Feature Transformation VectorAssembler StandardScaler
Model Evaluation BinaryClassificationEvaluator MulticlassEvaluator
Metrics AUC F1 Score Confusion Matrix

Clone this wiki locally