This repository implements a geoparsing pipeline for extracting and disambiguating geographic location entities from text using NLP and deep learning techniques. The system combines sequence labeling models for location extraction with gazetteer-based heuristics for coordinate resolution, enabling accurate mapping of textual location mentions to real-world geographic coordinates.
This project was developed as part of an undergraduate honours research project and evaluates classical ML and deep learning models across both extraction and disambiguation tasks.
The pipeline consists of two core stages:
-
Location Extraction
Identifies geographic entities in text using IOB tagging with:- Baseline lexical models
- SVM
- CRF
- BiLSTM-GRU
- RoBERTa (BERT-based)
-
Location Disambiguation
Resolves extracted toponyms to geographic coordinates using:- Gazetteer lookup (GeoNames)
- Population-based heuristics
- Distance-based heuristics
- A combined scoring model
keras
tensorflow
spacy
torch
TorchCRF
transformers
scikit-learn
- Due to limitations on GitHub repositories file sizes, please download the following sources and add them to the appropriate directories:
data/saved_data/geo_data/GeoNamesallCountries.zip,alternateNamesV2.zip,hierarchy.zip,featureCodes_en.txt: GeoNames Data
data/saved_data/glove.6B/glove.6B.50d.txt
- Please run the
Prediction_Evaluation.ipynbfile to set up all preprocessing modules and initialize/create model weights, then refer toPipeline_Evaluation.ipynbto extract and map coordinates directly when given text input.
Pre/preprocess.py: This is the preprocessing script used to convert the dataset into a corpus with relevant tokens, features, and IOB labels.Gaz/Gazetteer.py: This is a preprocessing script for the GeoNames data used to extract the relevant location names and metadata.Gaz/BKTree.py: This is a class used for quick string matching within the gazetteer, following a BKTree implementation.Dis/Disambiguation_Manager.py: The main script for running the disambiguation module.ML/Baseline_Manager.py: The main script for training and predicting using the baseline classifier.ML/BERT_Manager.py: The main script for training and predicting using the Custom RoBERTa classifier, weights are saved todata/saved_data/model_checkpoints.ML/BI_LSTM_Manager.py: The main script for training and predicting using the BI-LSTM classifier, weights are saved todata/saved_data/model_checkpoints.ML/CRF_Manager.py: The main script for training and predicting using the CRF classifier, weights are saved todata/saved_data/model_checkpoints.ML/SVM_Manager.py: The main script for training and predicting using the SVM classifier, weights are saved todata/saved_data/model_checkpoints.Results.xlsx: contains evaluation metrics results for each model in disambiguation and extractiondata/dataset: contains the LGL XML file dataset.data/geo_data: contains all GeoNames files and scripts to create pickle files from them.data/glove.6B: contains 50 dimensional GloVe embedding data.data/saved_data: contains all saved weights for models along with the saved preprocessed GeoNames and dataset data.data/utility: contains txt files that list stopwords and prepositions.
