Geoparsing: Location Entity Extraction & Disambiguation

This repository implements a geoparsing pipeline for extracting and disambiguating geographic location entities from text using NLP and deep learning techniques. The system combines sequence labeling models for location extraction with gazetteer-based heuristics for coordinate resolution, enabling accurate mapping of textual location mentions to real-world geographic coordinates.

This project was developed as part of an undergraduate honours research project and evaluates classical ML and deep learning models across both extraction and disambiguation tasks.

Project Overview

The pipeline consists of two core stages:

Location Extraction
Identifies geographic entities in text using IOB tagging with:
- Baseline lexical models
- SVM
- CRF
- BiLSTM-GRU
- RoBERTa (BERT-based)
Location Disambiguation
Resolves extracted toponyms to geographic coordinates using:
- Gazetteer lookup (GeoNames)
- Population-based heuristics
- Distance-based heuristics
- A combined scoring model

Python Dependencies

keras
tensorflow
spacy
torch
TorchCRF
transformers
scikit-learn

To Replicate

Due to limitations on GitHub repositories file sizes, please download the following sources and add them to the appropriate directories:
- data/saved_data/geo_data/GeoNames
  - allCountries.zip, alternateNamesV2.zip, hierarchy.zip, featureCodes_en.txt: GeoNames Data
- data/saved_data/glove.6B/glove.6B.50d.txt
  - GloVe Embeddings
Please run the Prediction_Evaluation.ipynb file to set up all preprocessing modules and initialize/create model weights, then refer to Pipeline_Evaluation.ipynb to extract and map coordinates directly when given text input.

File Descriptions

Pre/preprocess.py: This is the preprocessing script used to convert the dataset into a corpus with relevant tokens, features, and IOB labels.
Gaz/Gazetteer.py: This is a preprocessing script for the GeoNames data used to extract the relevant location names and metadata.
Gaz/BKTree.py: This is a class used for quick string matching within the gazetteer, following a BKTree implementation.
Dis/Disambiguation_Manager.py: The main script for running the disambiguation module.
ML/Baseline_Manager.py: The main script for training and predicting using the baseline classifier.
ML/BERT_Manager.py: The main script for training and predicting using the Custom RoBERTa classifier, weights are saved to data/saved_data/model_checkpoints.
ML/BI_LSTM_Manager.py: The main script for training and predicting using the BI-LSTM classifier, weights are saved to data/saved_data/model_checkpoints.
ML/CRF_Manager.py: The main script for training and predicting using the CRF classifier, weights are saved to data/saved_data/model_checkpoints.
ML/SVM_Manager.py: The main script for training and predicting using the SVM classifier, weights are saved to data/saved_data/model_checkpoints.
Results.xlsx: contains evaluation metrics results for each model in disambiguation and extraction
data/dataset: contains the LGL XML file dataset.
data/geo_data: contains all GeoNames files and scripts to create pickle files from them.
data/glove.6B: contains 50 dimensional GloVe embedding data.
data/saved_data: contains all saved weights for models along with the saved preprocessed GeoNames and dataset data.
data/utility: contains txt files that list stopwords and prepositions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Dis		Dis
Gaz		Gaz
ML		ML
Pre		Pre
assets		assets
data		data
.gitignore		.gitignore
Dataset_Evaluation.ipynb		Dataset_Evaluation.ipynb
Disambiguate_Evaluation.ipynb		Disambiguate_Evaluation.ipynb
Pipeline.py		Pipeline.py
Pipeline_Evaluation.ipynb		Pipeline_Evaluation.ipynb
Prediction_Evaluate.ipynb		Prediction_Evaluate.ipynb
README.md		README.md
Results.xlsx		Results.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geoparsing: Location Entity Extraction & Disambiguation

Project Overview

Python Dependencies

To Replicate

File Descriptions

About

Uh oh!

Releases

Packages

Languages

faith176/GeoParsing

Folders and files

Latest commit

History

Repository files navigation

Geoparsing: Location Entity Extraction & Disambiguation

Project Overview

Python Dependencies

To Replicate

File Descriptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages