The rise of online platforms has created new avenues for expression but also increased the risk of cyberbullying, which negatively impacts mental health and social well-being.
SafeSpace is a machine learning and NLP-based framework designed to detect cyberbullying behaviour in real-time, particularly focusing on Hinglish (Hindi + English) text.
The purpose of this project is to:
- Detect and classify online bullying behaviour in digital conversations.
- Provide instant feedback to discourage harmful content.
- Contribute towards creating safe, inclusive, and empathetic online environments.
- Hinglish Bullying Detection Dataset (public repository)
- Contains text entries labeled as bullying or non-bullying.
- Includes multilingual content with colloquial Hinglish phrases.
- Additional preprocessing resources:
stopwords.txt– custom stopword listfinal_dataset_hinglish.csv– cleaned dataset used in training
The methodology followed in this project consists of:
-
Data Cleaning & Preprocessing
- Removal of special characters, punctuation, extra spaces
- Stopword removal
- Tokenization using TensorFlow’s Keras Tokenizer
- Padding sequences to a maximum length of 200 tokens
-
Model Architecture – Hybrid CNN-BiLSTM
- Embedding Layer → Converts words into 128-dimensional vectors
- CNN Layer → Extracts local features from text
- Max Pooling → Reduces dimensionality
- BiLSTM Layer → Captures long-term sequential dependencies
- Dense + Softmax Layer → Classifies as bullying or non-bullying
-
Training Setup
- Optimizer: Adam (learning rate = 0.001)
- Loss: Sparse categorical crossentropy
- Epochs: 10 | Batch Size: 32
-
Evaluation Metrics
- Accuracy, Precision, Recall, F1-Score
- Training Accuracy: 99.66%
- Validation Accuracy: 91.02%
- Precision: 91.45%
- Recall: 90.65%
- F1-Score: 91.05%
These results show that the model is highly effective in detecting bullying behaviour while minimizing false positives.
- The CNN-BiLSTM hybrid model is effective in handling multilingual and informal Hinglish text.
- Most misclassifications occurred with ambiguous or idiomatic expressions.
- Python 3.8+
- Jupyter Notebook
- Libraries:
numpy,pandas,scikit-learn,tensorflow/keras,nltk
# Clone the repository
git clone https://github.com/VaishnavThorwat/SafeSpace__Digital_Behaviour_Monitor.git
cd SafeSpace__Digital_Behaviour_Monitor
# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter Notebook
jupyter notebook- Train or retrain the model: Run
SafeSpaceModel.ipynb - Simulate client-side inputs: Run
Client.ipynb - Run server-side classification: Run
Server.ipynb - Pretrained model (
CNNBILSTM.pkl) and TF-IDF vocabulary are provided for direct inference.
- Performance decreases with ambiguous or idiomatic Hinglish expressions.
- Limited to Hinglish dataset – multilingual scalability needs improvement.
- Integrating transformer models (BERT, DistilBERT) for better contextual handling.
- Extending to regional languages (Marathi, Bengali, etc.).
- Developing a real-time dashboard with visualization of bullying patterns.
- Embedding ethical safeguards – data anonymization, privacy-first architecture.
-
File Structure
SafeSpaceModel.ipynb– Model training & evaluationClient.ipynb– Simulated client interactionsServer.ipynb– Backend inference pipelineCNNBILSTM.pkl– Pre-trained modeltfidf_vector_vocabulary.pkl– Vocabulary filefinal_dataset_hinglish.csv– Training datasetstopwords.txt– Custom stopword list
-
Research Paper
- Title: SafeSpace: Digital Behaviour Monitor
- Published: Terna Engineering College, 2025
Licensed under the MIT License. See LICENSE for details.



