🧠 MLfromScratch

MLfromScratch is a library designed to help you learn and understand machine learning algorithms by building them from scratch using only NumPy! No black-box libraries, no hidden magic—just pure Python and math. It's perfect for beginners who want to see what's happening behind the scenes of popular machine learning models.

🔗 Explore the Documentation

📦 Package Structure

Our package structure is designed to look like scikit-learn, so if you're familiar with that, you'll feel right at home!

🔧 Modules and Algorithms (Explained for Beginners)

📈 1. Linear Models (`linear_model`)

LinearRegression : Imagine drawing a straight line through a set of points to predict future values. Linear Regression helps in predicting something like house prices based on size.
SGDRegressor : A fast way to do Linear Regression using Stochastic Gradient Descent. Perfect for large datasets.
SGDClassifier : A classification algorithm predicting categories like "spam" or "not spam."

🌳 2. Decision Trees (`tree`)

DecisionTreeClassifier : Think of this as playing 20 questions to guess something. A decision tree asks yes/no questions to classify data.
DecisionTreeRegressor : Predicts a continuous number (like temperature tomorrow) based on input features.

👥 3. K-Nearest Neighbors (`neighbors`)

KNeighborsClassifier : Classifies data by looking at the 'k' nearest neighbors to the new point.
KNeighborsRegressor : Instead of classifying, it predicts a number based on nearby data points.

🧮 4. Naive Bayes (`naive_bayes`)

GaussianNB : Works great for data that follows a normal distribution (bell-shaped curve).
MultinomialNB : Ideal for text classification tasks like spam detection.

📊 5. Clustering (`cluster`)

KMeans : Groups data into 'k' clusters based on similarity.
AgglomerativeClustering : Clusters by merging similar points until a single large cluster is formed.
DBSCAN : Groups points close to each other and filters out noise. No need to specify the number of clusters!
MeanShift : Shifts data points toward areas of high density to find clusters.

🌲 6. Ensemble Methods (`ensemble`)

RandomForestClassifier : Combines multiple decision trees to make stronger decisions.
RandomForestRegressor : Predicts continuous values using an ensemble of decision trees.
GradientBoostingClassifier : Builds trees sequentially, each correcting errors made by the last.
VotingClassifier : Combines the results of multiple models to make a final prediction.

📐 7. Metrics (`metrics`)

Measure your model’s performance:

accuracy_score : Measures how many predictions your model got right.
f1_score : Balances precision and recall into a single score.
roc_curve : Shows the trade-off between true positives and false positives.

⚙️ 8. Model Selection (`model_selection`)

train_test_split : Splits your data into training and test sets.
KFold : Trains the model in 'k' iterations for better validation.

🔍 9. Preprocessing (`preprocessing`)

StandardScaler : Standardizes your data so it has a mean of 0 and a standard deviation of 1.
LabelEncoder : Converts text labels into numerical labels (e.g., "cat", "dog").

🧩 10. Dimensionality Reduction (`decomposition`)

Dimensionality Reduction helps in simplifying data while retaining most of its valuable information. By reducing the number of features (dimensions) in a dataset, it makes data easier to visualize and speeds up machine learning algorithms.

PCA (Principal Component Analysis) : PCA reduces the number of dimensions by finding new uncorrelated variables called principal components. It projects your data onto a lower-dimensional space while retaining as much variance as possible.
- How It Works: PCA finds the axes (principal components) that maximize the variance in your data. The first principal component captures the most variance, and each subsequent component captures progressively less.
- Use Case: Use PCA when you have many features, and you want to simplify your dataset for better visualization or faster computation. It is particularly useful when features are highly correlated.

🎯 Why Use This Library?

Learning-First Approach: If you're a beginner and want to understand machine learning, this is the library for you. No hidden complexity, just code.
No Hidden Magic: Everything is written from scratch, so you can see exactly how each algorithm works.
Lightweight: Uses only NumPy, making it fast and easy to run.

🚀 Getting Started

# Clone the repository
git clone https://github.com/adityajn105/MLfromScratch.git

# Navigate to the project directory
cd MLfromScratch

# Install the required dependencies
pip install -r requirements.txt

👨‍💻 Author

This project is maintained by Aditya Jain

🧑‍💻 Contributors

Constributor: Subrahmanya Gaonkar

We welcome contributions from everyone, especially beginners! If you're new to open-source, don’t worry—feel free to ask questions, open issues, or submit a pull request.

🤝 How to Contribute

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Make your changes and commit (git commit -m "Added new feature").
Push the changes (git push origin feature-branch).
Submit a pull request and explain your changes.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 MLfromScratch

📦 Package Structure