Welcome to my repository of Machine Learning projects, notebooks, and datasets.
This repo showcases my journey and experiments in Machine Learning, Data Science, and Exploratory Data Analysis through real-world problems and custom-created datasets.
1) π Digits Prediction
File: digits-prediction.ipynb
Whatβs inside
- Dataset: MNIST (60k train / 10k test) β EDA showing sample digits and class balance.
- Preprocessing: normalize pixel values to [0,1], reshape for CNN (
28x28x1
), one-hot encode labels. - Augmentation (optional): small rotations, shifts, random zoom to improve generalization.
- Models included:
- Baseline MLP (dense β relu β dropout β softmax).
- CNN (Conv2D β BatchNorm β ReLU β MaxPool β Dropout β Dense).
- Optionally a Transfer Learning variant using a small pretrained encoder (for experiments).
- Training artifacts: training/validation curves, confusion matrix, per-class accuracy, SavedModel export.
How model is trained
- Algorithm: Convolutional Neural Network (Keras/TensorFlow).
- Loss / Optimizer:
categorical_crossentropy
withAdam
(lr =1e-3
default). - Batch size / Epochs:
batch_size=64
,epochs=20β50
withEarlyStopping(patience=5, restore_best_weights=True)
. - Callbacks:
ModelCheckpoint
,ReduceLROnPlateau
,TensorBoard
for visual checks. - Regularization: Dropout (
0.25β0.5
), BatchNorm, L2 (optional).
Data split
- Use MNIST standard split: 60,000 training, 10,000 test.
- Create validation from train (e.g., 90/10): β
train=54k
,val=6k
,test=10k
. - Example snippet:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.10, stratify=y_train_full, random_state=42)
2) π House Price Prediction
File: house-price.ipynb
Whatβs inside
- Dataset: Kaggle House Prices β rich tabular dataset with numeric & categorical features.
- EDA: distribution plots, missingness heatmap, correlations, target skew (log-transform).
- Feature engineering: date features, polynomial & interaction terms, categorical encoding (One-Hot / Target Encoding), outlier handling.
- Pipelines:
ColumnTransformer
for numeric + categorical,Pipeline
for end-to-end training. - Models compared:
- Linear Regression
- Lasso
- RandomForest
- XGBoost / LightGBM
- Stacking ensemble (meta-model).
- Explainability: feature importance (tree SHAP), partial dependence plots.
How model is trained
- Algorithm: Ensemble regression models (RandomForest, XGBoost, LightGBM) + stacking.
- Loss / Objective: regression metrics (
MSE
,RMSE
). Target oftenlog1p
transformed to stabilize variance. - Cross-validation: 5-fold CV with Out-of-Fold (OOF) predictions for stacking.
- Hyperparameter tuning:
RandomizedSearchCV
orOptuna
for RF, XGB, LGBM. - Typical tuning knobs:
- XGBoost β
n_estimators=200β2000
,learning_rate=0.01β0.2
,max_depth=3β10
,subsample=0.5β1.0
.
- XGBoost β
Pipeline sketch
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer([...])
model = Pipeline([
('prep', preprocessor),
('clf', xgb.XGBRegressor(...))
])
3) π Titanic Survival
File: titanic.ipynb
Whatβs inside
- EDA: survival rate by sex/class/age, missingness patterns (Age, Cabin).
- Feature engineering: extract
Title
from Name, family size (SibSp + Parch
), deck from Cabin (where possible), fill missing Age via median or model imputation. - Encoding: Sex β binary, Embarked β one-hot; Fare binned if useful.
- Models compared:
- Logistic Regression (baseline)
- RandomForest
- XGBoost
- Model interpretability: coefficient table for Logistic Regression, SHAP or Permutation Importance for tree-based models.
How model is trained
- Algorithms:
LogisticRegression
(L2),RandomForestClassifier
,XGBClassifier
. - Cross-validation:
StratifiedKFold(n_splits=5)
to preserve class balance. - Metrics: Accuracy, Precision, Recall, F1, ROC-AUC β plus confusion matrix.
- Typical hyperparameters:
- RandomForest β
n_estimators=100β500
,max_depth=None
or tuned. - XGBoost β
learning_rate=0.01β0.2
,max_depth=3β8
.
- RandomForest β
Data split
- Use stratified split to keep survival ratio consistent:
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=42
)
4) π Heads or Tails
File: heads-or-tails.ipynb
Whatβs inside
- Problem: classify an image patch as heads or tails. Dataset: custom / cropped images.
- Preprocessing: resize (e.g.,
128x128
), normalize, augmentation pipeline (flips, rotations, brightness jitter). - Model choices:
- Lightweight custom CNN (baseline).
- Transfer learning using MobileNetV2 or EfficientNetB0 with fine-tuning for better accuracy.
- Training utilities: class weights (if imbalance),
ImageDataGenerator
ortf.data
pipeline, Grad-CAM for explainability.
How model is trained
- Loss / Optimizer:
binary_crossentropy
, Adam or SGD with momentum for fine-tuning. - Batch size / Epochs:
batch_size=32
,epochs=20β40
, withEarlyStopping
+ReduceLROnPlateau
. - Fine-tuning schedule: freeze base model for N epochs, then unfreeze top k layers and continue training with a lower learning rate (
1e-4 β 1e-5
).
Data split
- Example:
train : val : test = 70 : 15 : 15
, or keeptrain/val = 80/20
with a separate test set if available. - Ensure stratified split by label.
5) π Rock vs Mine (RvsM)
File: rock-vs-mine.ipynb
(Colab link available)
Whatβs inside
- Dataset: Sonar / echo features (e.g., UCI Sonar dataset or similar).
- Preprocessing: scaling with
StandardScaler
, optional signal preprocessing (smoothing, FFT features). - Models compared:
- Logistic Regression (baseline)
- SVM (RBF,
probability=True
) - RandomForest / XGBoost (strong tree baselines)
- Model evaluation: Accuracy, ROC-AUC, confusion matrix, Precision, Recall β important for safety-critical detection.
How model is trained
- Algorithms: SVM (RBF), RandomForest, XGBoost.
- Cross-validation:
StratifiedKFold(n_splits=5)
with grid/random search. - Example hyperparameters:
- SVM β
C β {0.1, 1, 10}
,gamma β {'scale','auto'}
or tuned via logspace. - XGBoost β
learning_rate=0.01β0.1
,max_depth=3β6
.
- SVM β
- Calibration / Tuning: decision threshold analysis using precision-recall tradeoff.
- If false negatives are costly, optimize for recall with constrained precision.
Data split
- Typical:
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.10, stratify=y, random_state=1
)
Notebooks live inside
/notebooks/
and include well-commented, reproducible code.
1. π Superheroes Abilities Dataset
File: superheros-abilities.ipynb
Whatβs inside
- EDA: distribution of powers, missing data handling for free-text attributes.
- Feature engineering: converting textual powers into categorical features (one-hot or embeddings), aggregate scores (e.g.,
power_score
). - Experiments:
- Clustering: KMeans / HDBSCAN to discover archetypes.
- Classification / Regression: predict alignment or power level using RandomForest / XGBoost.
- Dimensionality reduction: PCA / t-SNE / UMAP for visualizations.
How model is trained
- Supervised tasks: RandomForest / XGBoost with 5-fold CV.
- Clustering: hyperparameter sweep on
k
with silhouette scores / Davies-Bouldin index. - Text features: simple TF-IDF β PCA, or embedding pipeline if needed.
Data split
- Predictive tasks:
train_test_split(X, y, test_size=0.20, random_state=42)
β 80/20. - Unsupervised tasks: use full dataset, with holdouts only for downstream validation.
File: code-variants.ipynb
Whatβs inside
- Short summary: many Python solutions for the same problems. Useful for finding similar code, detecting copies, and building a code search tool.
- Quick EDA: count of problems, number of variants per problem, average lines and tokens.
- Preprocessing: remove comments, normalize spacing, optionally rename variables to placeholders, and tokenize code.
- Prepared features:
- Token TF-IDF vectors for a fast baseline.
- Small AST-based counts (e.g., number of
if
,for
, anddef
nodes). - Optional pretrained code embeddings for stronger semantic matching.
- Handcrafted metrics such as cyclomatic complexity and number of function calls.
Experiments included
- Fast baseline: TF-IDF on tokens + cosine similarity.
- Pair classifier: embed two snippets and train a model to predict same/different problem.
- Stronger route: fine-tune a pretrained code model for pairwise similarity.
- Structural: convert AST into a simple graph and use a graph model.
- Evaluation: retrieval metrics + ablations comparing token vs. structural approaches.
How the model is trained
- Two options:
- Baseline: TF-IDF vectors + cosine similarity (no training).
- Learned model: encoder maps snippets to vectors; trained so same-problem pairs are close, different-problem pairs are far.
- Typical settings:
- Batch size: smallβmedium depending on hardware.
- Optimizer: Adam or AdamW.
- Early stopping on validation performance.
- Practical tricks:
- Use balanced batches with equal positive and negative pairs.
- Periodically include hard negatives (look similar but are different).
- Use an ANN library for fast nearest-neighbor search at evaluation and deployment.
Simple training loop
for epoch in range(epochs):
for code_a, code_b, label in dataloader:
emb_a = encoder(code_a)
emb_b = encoder(code_b)
loss = contrastive_or_bce_loss(emb_a, emb_b, label)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Data split
- Recommended split by problem so all variants of a problem go to the same partition.
- Example: 70% problems β train, 15% β validation, 15% β test.
- For pair training:
- Positive pairs: variants of the same problem.
- Negative pairs: different problems.
Ensure test problems are unseen during training for realistic evaluation.
Tool / Library | Purpose |
---|---|
Python | Core programming language |
Pandas, NumPy | Data manipulation & analysis |
Matplotlib, Seaborn | Data visualization |
Scikit-learn | ML modeling, TF-IDF (TfidfVectorizer), pipelines, metrics |
TensorFlow / PyTorch | Deep learning (future work) |
Jupyter Notebook | Interactive code & documentation |
Kaggle API | Dataset handling automation |
Git & GitHub | Version control and collaboration |
NLTK / spaCy | Tokenization, lemmatization, English stopwords lists |
TfidfVectorizer (sklearn) |
TF-IDF extraction β supports stop_words='english' or custom lists, n-grams, min_df , max_df |
Custom stopwords (domain) | Add domain-specific tokens (e.g., variable names, common words) to stoplist |
Sentence-Transformers / Transformers | Semantic embeddings for stronger text similarity / classification |
Gensim | Word2Vec / Doc2Vec / fast text embeddings |
Hugging Face Tokenizers | Fast tokenization for transformer models |
imbalanced-learn (SMOTE) | Oversampling for imbalanced classes |
Optuna / RandomizedSearch | Hyperparameter tuning (efficient search for model + vectorizer params) |
FAISS / Annoy | Fast nearest-neighbor search for retrieval tasks |
# Clone the repository
git clone https://github.com/hemathens/kaggle-projects.git
cd kaggle-projects
# View datasets
cd datasets/
# View notebooks
cd notebooks/
View notebooks on GitHub by clicking the .ipynb files β GitHub renders them read-only.
Or use nbviewer: https://nbviewer.org/github/hemathens/kaggle-projects/blob/main/path/to/notebook.ipynb
Or open in Colab (runs in browser):
https://colab.research.google.com/github/hemathens/kaggle-projects/blob/main/notebooks/your_notebook.ipynb
Replace notebooks/your_notebook.ipynb with the real path.
A. Create & activate a virtual environment
Linux / macOS (bash/zsh):
python3 -m venv .venv
source .venv/bin/activate
Windows (PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
Windows (CMD):
python -m venv .venv
.\.venv\Scripts\activate.bat
B. Install dependencies
If the repo includes requirements.txt:
pip install -r requirements.txt
If the repo includes environment.yml (conda):
conda env create -f environment.yml
conda activate <env-name>
C. Optional: make the venv available as a Jupyter kernel
pip install ipykernel
python -m ipykernel install --user --name=kaggle-projects-env --display-name "kaggle-projects-env"
Then pick this kernel inside Jupyter/Lab when opening notebooks.
D. Start JupyterLab (recommended) or Notebook
jupyter lab
# or
jupyter notebook
Open the notebook file (for example notebooks/digits-prediction.ipynb) in the browser and run cells.
E. Kaggle dataset (if needed)
Install Kaggle CLI:
pip install kaggle
Place kaggle.json (your API token) in: -Linux/macOS: ~/.kaggle/kaggle.json (set permissions chmod 600 ~/.kaggle/kaggle.json) -Windows: %USERPROFILE%.kaggle\kaggle.json Download dataset (example):
kaggle datasets download -d hemajitpatel/code-similarity-dataset-python-variants -p datasets/ --unzip
These datasets are published under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
You are free to:
- Share β copy and redistribute the material in any medium or format
- Adapt β remix, transform, and build upon the material for any purpose
Under the following terms:
- Attribution β You must give appropriate credit by linking to my Kaggle profile, provide a link to the license, and indicate if changes were made.
Full License: https://creativecommons.org/licenses/by/4.0/