PH Language Similarity Analysis

This project analyzes the textual similarity of 13 Philippine languages (along with English, Spanish, and Yami) using computational linguistic techniques. The goal is to quantify how similar these languages are by comparing their character-level n-gram profiles. This method captures orthographic and distributional patterns in the text.

The analysis is based on a 1.2-million-word multilingual corpus collected from online bible texts. The similarity between languages is calculated using 3-gram character features, weighted by TF-IDF, and then compared using Cosine Similarity.

Final Results

Language Similarity Heatmap

Language Family Dendrogram

Technologies Used

Python
Data Collection: Selenium, BeautifulSoup
Data Analysis: Pandas, NumPy
NLP/ML: scikit-learn (TF-IDF, Cosine Similarity), SciPy (Hierarchical Clustering)
Data Visualization: Matplotlib, Seaborn
Development Environment: Jupyter Notebook

Data Pipeline

The project follows a 6-step data science pipeline:

Data Collection: A custom web scraper (using Selenium and BeautifulSoup) collected a 1.2-million-word corpus from online bible texts across 16 languages.
Text Cleaning: The raw text for each language was normalized using regex to remove all punctuation, numbers, and special characters, leaving only lowercase alphabetic characters and whitespace.
N-Gram Generation: Each cleaned language corpus was tokenized into character 3-grams (trigrams). A master vocabulary of all unique n-grams was created (6,378 features).
Feature Engineering (TF-IDF): A TF-IDF (Term Frequency-Inverse Document Frequency) matrix was constructed from the n-gram frequencies. This weighs each 3-gram based on its importance to a specific language versus the entire corpus.
Similarity Analysis: Cosine Similarity was applied to the TF-IDF matrix to calculate a final 16x16 similarity score between every pair of languages.
Visualization: The final similarity matrix was visualized as a heatmap (using Seaborn) and a dendrogram (using SciPy) to model the language family relationships.

How to Run

The project is broken into several Jupyter notebooks that should be run in order:

scraper.ipynb: Collects the raw text data.
cleaner.ipynb: Cleans the raw text files.
ngrams_and_vocab.ipynb: Generates 3-grams and the master vocabulary.
feature_engineering.ipynb: Calculates the TF-IDF matrix.
similarity_matrix.ipynb: Computes the final cosine similarity and generates the heatmap.
dendrogram.ipynb: Generates the hierarchical clustering dendrogram.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
deliverables		deliverables
notebooks		notebooks
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PH Language Similarity Analysis

Final Results

Language Similarity Heatmap

Language Family Dendrogram

Technologies Used

Data Pipeline

How to Run

About

Uh oh!

Releases

Packages

Languages

License

Dawsxn/PH-Language-Similarity-Analysis

Folders and files

Latest commit

History

Repository files navigation

PH Language Similarity Analysis

Final Results

Language Similarity Heatmap

Language Family Dendrogram

Technologies Used

Data Pipeline

How to Run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages