This repository contains the full codebase and outputs for the thesis project:
"Trait Prediction from Dutch Company Names using Surface and Semantic Features."
We investigate whether letter-level patterns and semantic embeddings can predict human perceptions across four traits:
Femininity, Evilness, Trustworthiness, Smartness
We trained and evaluated models using two algorithms:
- ElasticNet (for interpretable linear baselines)
- Feedforward Neural Networks (FFNNs) (for non-linear modeling)
We tested four feature sets:
- 🟦 Unigrams (letter counts)
- 🟩 Bigrams (letter pairs)
- 🟨 RobBERT (Dutch semantic embeddings)
- 🟥 Combined (all features together)
| File/Folder | Description |
|---|---|
main.ipynb |
Full pipeline for all experiments and models |
models/ |
Trained models, metrics, feature importances, and plots |
⚠️ Some preprocessing steps are not in the notebook but all final modeling data is included.
- ✔️ Compare model performance across all four traits
- 🔍 See ElasticNet feature importances per trait (letters, bigrams, or embeddings)
- 🧪 View charts not shown in the thesis
- 🔁 Adapt the code to predict traits in other domains (e.g., product names)
We trained:
- ElasticNet regressors and Feedforward Neural Networks (FFNNs)
- Across 4 feature sets: Unigram, Bigram, RobBERT, and Combined
Each model’s results (train/val/test R², feature count, etc.) are stored in: models/{model_name}/ffnn_metrics.csv models/{model_name}/all_models_metrics.csv
yaml Copy Edit
All test plots and coefficient visualizations can be found in the models/ folder — including additional ones not shown in the thesis.
Tested on Windows 10, with Python 3.13.2.
pip install -r requirements.txtscikit-learn==1.4.2
torch==2.7.0, transformers==4.51.3
pandas, numpy, matplotlib, joblib
This work explores how linguistic cues in names — from letters to embeddings — shape trait perception. All results are explained in detail in the accompanying thesis.
If this repo helps your research, please cite the thesis or link back to this repository.