Terminology Extraction from Domain Corpus

Overview

This project implements a pipeline for extracting domain-specific terminology from a corpus of scientific abstracts using statistical methods such as Domain Relevance (DR) and Domain Consensus (DC). The project is designed to identify terms that are both frequent and consistent within a domain corpus compared to a reference corpus.

Project structure

Terminology-Extraction
├── data/
|   └── gold_terminology_abstracts.txt      # required for evaluation
├── output/                             # results directory
|   ├── exp_1.tsv
|   └── ...
├── src/                                # source code
|   ├── main.py                             # CLI entry point
|   ├── processing.py                       # preprocessing & vectorization
|   ├── computations.py                     # metric calculations & extraction logic
|   └── utils.py
├── tests/                              # unit tests
|   ├── __init__.py
|   ├── conftests.py                        # fixtures
|   ├── test_computations.py                # tests for extraction logic
|   └── test_processing.py                  # tests for preprocessing pipeline
└── README.md                           # documentation
└── requirements.txt                    # pip-requirements
└── requirements_conda.txt              # conda-requirements

Features

Parsing abstracts of scientific publications from BibTeX file
Corpus preprocessing & vectorization
Candidate-bigram selection based on token-POS tags
Computation of DR & DC metrics
Extraction of domain terminology from set of candidate-bigrams based on hyperparameters

Requirements

Python >=3.9
Libraries: numpy, pandas, scipy, scikit-learn, spacy, click

Installation

Clone the repository:

git clone https://github.com/psandhaas/terminology-extraction.git
cd terminology-extraction

Install dependencies:

using pip

pip install -r requirements.txt

or using conda:

conda create -n myenv --file requirements_conda.txt

Download the default spaCy model (or provide a different one):

python -m spacy download en_core_web_md

Usage

Run the end-to-end terminology extraction pipeline using the CLI:

python ./src/main.py extract -d <path/to/bibtex.bib> -a 0.5 -t 2.0

Parameters

--domain_bibtex: Path to a BibTeX file containing abstracts.
--alpha: Hyperparameter to weigh relative contributions of DR and DC values of candidates to their terminology scores.
--theta: Hyperparameter that is used as a threshold for terminology scores of candidates. Any candidates with a weighted terminology score equal or greater than theta are extracted.

Options

--save: Whether to save extracted terms to the ./output/ directory.
--help: Show CLI-help message and exit.
--gold_filepath: Path to a TXT-file containing a single gold-standard term per line.

Output

The extracted terms are saved as CSV files in the ./output/ directory.

Tests

Run tests from the terminal:

python -m pytest

Background

Definitions

$$ TerminologyScore = \alpha \cdot DR + (\alpha - 1) \cdot DC $$ $$ Terminology = {TerminologyScore(t) \gt \theta : t \in Candidates_{domain} | Candidates_{domain} \subset \Sigma_{domain}} $$

Literature

Paola Velardi, Paolo Fabriani, and Michele Missikoff. 2001. Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the international conference on Formal Ontology in Information Systems - Volume 2001 (FOIS '01). Association for Computing Machinery, New York, NY, USA, 270–284. https://doi.org/10.1145/505168.505194
Wendt, M., Buscher, C., Herta, C., Gerlach, M., Messner, M., Kemmerer, S., & Tietze, W. (2009, September). Extracting domain terminologies from the World Wide Web. In Web as Corpus Workshop (WAC5) (p. 79). PDF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminology Extraction from Domain Corpus

Overview

Project structure

Features

Requirements

Installation

Usage

Parameters

Options

Output

Tests

Background

Definitions

Literature

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_conda.txt		requirements_conda.txt

psandhaas/Terminology-Extraction

Folders and files

Latest commit

History

Repository files navigation

Terminology Extraction from Domain Corpus

Overview

Project structure

Features

Requirements

Installation

Usage

Parameters

Options

Output

Tests

Background

Definitions

Literature

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages