Skip to content

Developed a geoparsing pipeline that uses deep learning techniques to extract locations from text for enhanced geospatial information retrieval

Notifications You must be signed in to change notification settings

faith176/GeoParsing

Repository files navigation

Geoparsing: Location Entity Extraction & Disambiguation

This repository implements a geoparsing pipeline for extracting and disambiguating geographic location entities from text using NLP and deep learning techniques. The system combines sequence labeling models for location extraction with gazetteer-based heuristics for coordinate resolution, enabling accurate mapping of textual location mentions to real-world geographic coordinates.

This project was developed as part of an undergraduate honours research project and evaluates classical ML and deep learning models across both extraction and disambiguation tasks.


Project Overview

The pipeline consists of two core stages:

  1. Location Extraction
    Identifies geographic entities in text using IOB tagging with:

    • Baseline lexical models
    • SVM
    • CRF
    • BiLSTM-GRU
    • RoBERTa (BERT-based)
  2. Location Disambiguation
    Resolves extracted toponyms to geographic coordinates using:

    • Gazetteer lookup (GeoNames)
    • Population-based heuristics
    • Distance-based heuristics
    • A combined scoring model

Sample


Python Dependencies

keras
tensorflow
spacy
torch
TorchCRF
transformers
scikit-learn

To Replicate

  • Due to limitations on GitHub repositories file sizes, please download the following sources and add them to the appropriate directories:
    • data/saved_data/geo_data/GeoNames
      • allCountries.zip, alternateNamesV2.zip, hierarchy.zip, featureCodes_en.txt: GeoNames Data
    • data/saved_data/glove.6B/glove.6B.50d.txt
  • Please run the Prediction_Evaluation.ipynb file to set up all preprocessing modules and initialize/create model weights, then refer to Pipeline_Evaluation.ipynb to extract and map coordinates directly when given text input.

File Descriptions

  • Pre/preprocess.py: This is the preprocessing script used to convert the dataset into a corpus with relevant tokens, features, and IOB labels.
  • Gaz/Gazetteer.py: This is a preprocessing script for the GeoNames data used to extract the relevant location names and metadata.
  • Gaz/BKTree.py: This is a class used for quick string matching within the gazetteer, following a BKTree implementation.
  • Dis/Disambiguation_Manager.py: The main script for running the disambiguation module.
  • ML/Baseline_Manager.py: The main script for training and predicting using the baseline classifier.
  • ML/BERT_Manager.py: The main script for training and predicting using the Custom RoBERTa classifier, weights are saved to data/saved_data/model_checkpoints.
  • ML/BI_LSTM_Manager.py: The main script for training and predicting using the BI-LSTM classifier, weights are saved to data/saved_data/model_checkpoints.
  • ML/CRF_Manager.py: The main script for training and predicting using the CRF classifier, weights are saved to data/saved_data/model_checkpoints.
  • ML/SVM_Manager.py: The main script for training and predicting using the SVM classifier, weights are saved to data/saved_data/model_checkpoints.
  • Results.xlsx: contains evaluation metrics results for each model in disambiguation and extraction
  • data/dataset: contains the LGL XML file dataset.
  • data/geo_data: contains all GeoNames files and scripts to create pickle files from them.
  • data/glove.6B: contains 50 dimensional GloVe embedding data.
  • data/saved_data: contains all saved weights for models along with the saved preprocessed GeoNames and dataset data.
  • data/utility: contains txt files that list stopwords and prepositions.

About

Developed a geoparsing pipeline that uses deep learning techniques to extract locations from text for enhanced geospatial information retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published