This is a machine learning project that performs sentiment analysis on customer reviews. The project includes data preprocessing, vectorization using TF-IDF, and modeling using logistic regression, random forest, and XGBoost classifiers. The logistic regression model achieved the highest performance with an AUC score of 0.878 and an accuracy of 78.4%.
- Preprocess customer reviews by removing duplicates, converting text to lowercase, removing punctuation, and applying lemmatization and stemming
- Vectorize textual data using TF-IDF
- Train and evaluate logistic regression, random forest, and XGBoost models
- Perform cross-validation and hyperparameter tuning using GridSearchCV
- Display model performance metrics including AUC score and accuracy
- Python
- Scikit-learn
- XGBoost
- NLTK
- Pandas
- Jupyter Notebook