Token Reducer - Politeness Stripper

Try It Here: https://token-reducer.streamlit.app/

 ____        ___        __                                           
/\  _`\     /\_ \    __/\ \__                                        
\ \ \L\ \___\//\ \  /\_\ \ ,_\    __    ___      __    ____    ____  
 \ \ ,__/ __`\\ \ \ \/\ \ \ \/  /'__`\/' _ `\  /'__`\ /',__\  /',__\ 
  \ \ \/\ \L\ \\_\ \_\ \ \ \ \_/\  __//\ \/\ \/\  __//\__, `\/\__, `\
   \ \_\ \____//\____\\ \_\ \__\ \____\ \_\ \_\ \____\/\____/\/\____/
    \/_/\/___/ \/____/ \/_/\/__/\/____/\/_/\/_/\/____/\/___/  \/___/ 
                                                                     
                                                                     
 ____    __                                               
/\  _`\ /\ \__         __                                 
\ \,\L\_\ \ ,_\  _ __ /\_\  _____   _____      __   _ __  
 \/_\__ \\ \ \/ /\`'__\/\ \/\ '__`\/\ '__`\  /'__`\/\`'__\
   /\ \L\ \ \ \_\ \ \/ \ \ \ \ \L\ \ \ \L\ \/\  __/\ \ \/ 
   \ `\____\ \__\\ \_\  \ \_\ \ ,__/\ \ ,__/\ \____\\ \_\ 
    \/_____/\/__/ \/_/   \/_/\ \ \/  \ \ \/  \/____/ \/_/ 
                              \ \_\   \ \_\               
                               \/_/    \/_/

Note: This project is currently under active development. Features and usage may change.

Token Reducer is a Python CLI tool that automatically removes polite expressions and greetings from text. It is especially useful for preprocessing in NLP applications or reducing token count.

Features

Automatically removes polite expressions (e.g., "thanks", "best regards", "please", etc.)
Cleans up greetings and sign-offs
Easy to use from the command line
Can be integrated into pipelines with stdin/stdout
Model is trained on the Intel/polite-guard dataset
Provides detailed reports of removed polite expressions

Technologies & Algorithms

scikit-learn: Used for feature extraction and model training
datasets (Hugging Face): For loading the Intel/polite-guard dataset
CountVectorizer: For extracting n-gram features from text
Log-odds Scoring: To identify and rank polite expressions
spaCy: For text lemmatization and linguistic preprocessing
pickle: For model serialization

We have implemented multiple approaches to politeness detection and removal:

Original Approach: Used CountVectorizer and log-odds scoring to select the top 250 polite n-grams and build a regex-based stripper.
TF–IDF + Logistic Regression pipeline: Added a TfidfVectorizer followed by an LogisticRegression model inside a Scikit‑learn Pipeline, enabling:
- Weighted n-gram features (TF–IDF) for more nuanced text representation.
- L2 regularization in logistic regression to prevent overfitting.
- Single-step fit/predict calls and seamless integration with GridSearchCV for hyperparameter tuning.
Direct Feature Matching Approach: Our latest implementation uses:
- Stored dictionary of polite features with their importance scores
- SpaCy-powered lemmatization to match word variants
- Advanced n-gram detection for multi-word polite expressions
- Flexible partial matching to catch related expressions
- Returns both cleaned text and a list of removed expressions
- Politeness score calculation based on number of detected polite expressions

The direct feature matching approach offers several advantages:

More transparent results (shows exactly what was removed)
Better handling of context through n-gram detection
More flexible matching through lemmatization
Simpler configuration through adjustable thresholds

Performance Metrics

Below are the evaluation results comparing the TF–IDF + LR pipeline against the log-odds classifier on a held-out test set:

Confusion Matrix: Pipeline

Confusion Matrix: Log-Odds

Precision-Recall Curves

ROC Curves

Metric	Pipeline (TF–IDF + LR)	Log-Odds Classifier	Direct Feature Matching
Accuracy	95%	93%	94%
Macro F1-Score	0.95	0.93	0.94
ROC AUC	0.989	0.973	N/A
Average Precision	0.99	0.97	N/A
Interpretability	Low	Medium	High

The direct feature matching approach provides excellent balance between accuracy and interpretability of results.

Installation

Install the required packages:
```
pip install -r requirements.txt
```
Install spaCy model:
```
python -m spacy download en_core_web_sm
```
Ensure model directory exists:
```
mkdir -p model
```
Train the model (if not using pre-trained model):
```
python train_model.py
```

Usage

Web Interface (Streamlit App)

The project includes a web-based interface built with Streamlit that makes it even easier to use:

streamlit run app.py

This will start a local web server and open the interface in your browser. The web UI features:

Interactive text input and file upload options
Real-time politeness analysis with metrics
Side-by-side text comparison
Visual representation of removed expressions
Token reduction statistics
Adjustable politeness threshold

Command Line Interface

The project includes a user-friendly command-line interface that allows you to interactively strip polite expressions from text:

Run the CLI tool:
```
python CLI.py
```
The CLI will display a welcome banner and prompt you to enter sentences.
Type your text with polite expressions and press Enter to see the cleaned version.
Type 'exit' when you want to quit the program.

Example session:

>  Could you please help me with this task, thank you

Politeness Score: 0.2000

Cleaned Sentence:
Could you help me with this task

Removed Features:
- please
- thank you

> exit
Goodbye!

The CLI uses Rich for colored output to enhance readability of the results.

Files

strip_polite.py: Main function to remove polite expressions and extract features
train_model.py: Trains the model and generates the polite features dictionary
model/polite_features.pkl: Trained model with polite expression weights
CLI.py: Interactive command-line interface
evaluate.py: Script for detailed performance evaluation and visualization
requirements.txt: Required Python packages

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
assets		assets
model		model
.gitignore		.gitignore
CLI.py		CLI.py
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
spell_correct.py		spell_correct.py
strip_polite.py		strip_polite.py
train_model.py		train_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Token Reducer - Politeness Stripper

Try It Here: https://token-reducer.streamlit.app/

Features

Technologies & Algorithms

Performance Metrics

Confusion Matrix: Pipeline

Confusion Matrix: Log-Odds

Precision-Recall Curves

ROC Curves

Installation

Usage

Web Interface (Streamlit App)

Command Line Interface

Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

firatyll/token-reducer

Folders and files

Latest commit

History

Repository files navigation

Token Reducer - Politeness Stripper

Try It Here: https://token-reducer.streamlit.app/

Features

Technologies & Algorithms

Performance Metrics

Confusion Matrix: Pipeline

Confusion Matrix: Log-Odds

Precision-Recall Curves

ROC Curves

Installation

Usage

Web Interface (Streamlit App)

Command Line Interface

Files

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages