This repository contains the materials for a project developed at the Indian Institute of Information Technology, Surat, by students Ankit Gondha and Jitanshu Raut under the guidance of Dr. Pradeep Kumar Roy and Dr. Rajesh K. Ahir. The project focuses on detecting SQL injection attacks using a hybrid approach that combines machine learning, natural language processing, and regular expression analysis.
SQL injection (SQLi) is a prevalent security threat to web applications that manipulate SQL statements to execute malicious queries, allowing attackers unauthorized access to database resources. Our project addresses this critical security risk by developing a detection system capable of identifying SQL injection attempts with high accuracy.
Web applications are susceptible to various cyber-attacks, with SQL injection being one of the most effective due to vulnerabilities in the database query software. Traditional methods like input validation and parameterized queries are often bypassed by sophisticated hacking techniques. Therefore, an advanced approach that can evolve and adapt to new SQL injection strategies is required.
The main objective of our project is to:
- Develop a system that effectively detects SQL injection attacks.
- Combine regular expression analysis with advanced machine learning techniques to improve detection accuracy.
- Utilize the BERT model for processing SQL queries to enhance the understanding of complex queries.
Our approach involves three primary techniques:
- Regular Expression Analysis: This technique uses regular expressions to detect patterns and keywords that are commonly associated with SQL injection attacks. It provides a first line of defense by filtering out obvious malicious queries.
- Machine Learning Techniques: We employ several machine learning models to learn from both benign and malicious SQL queries. These models help identify less obvious patterns that might not be caught by regex analysis.
- Natural Language Processing (NLP): Using the BERT model, our system gains a deeper understanding of the context within SQL queries, which helps in distinguishing between harmful and normal activities more accurately.
-
Regex Analysis:
- Regex patterns are used to match known malicious SQL query structures.
- This method quickly flags potentially dangerous queries for further analysis.
-
Machine Learning Models:
- Random Forest: A robust model that classifies queries based on features extracted from the data. It is known for its high accuracy and the ability to handle overfitting.
- Support Vector Machines (SVMs): Used for their effectiveness in high-dimensional spaces, SVMs are critical for classifying complex query structures.
- BERT Model: Implements a transformer-based architecture that excels in understanding the context and semantics of text, which is crucial for parsing and analyzing SQL queries.
-
Data Preprocessing and Feature Extraction:
- We use techniques like TF-IDF, bag of words, and word embeddings to convert SQL queries into a format that can be processed by machine learning models.
Our models are trained and tested on a variety of datasets that include benign and malicious SQL queries. These datasets are sourced from public platforms such as GitHub and Kaggle, providing a diverse range of scenarios for robust testing.
We evaluate our models based on several metrics:
- Accuracy: Overall correctness of the model.
- Precision and Recall: Effectiveness in identifying positive instances.
- F1 Score: Harmonic mean of precision and recall, providing a balance between the two.
Our system demonstrated promising results in detecting SQL injection attacks:
- Random Forest showed an accuracy of 81.33%, with a precision of 99.00% and an F1-score of 84.13%.
- SVMs achieved a precision of 93.60%, though they had lower recall, indicating room for improvement.
- BERT Model excelled with an accuracy of 98.11%, Precision of 99.99%, a near-perfect precision, and an F1-score of 98.07%, indicating very high efficacy in detecting SQL injections.