Discourse Segmenter

A collection of various discourse segmenters (with pre-trained models for German texts).

Description

This python module currently comprises three discourse segmenters: edseg, bparseg, and mateseg.

edseg: is a rule-based system that uses shallow discourse-oriented parsing to determine boundaries of elementary discourse units in text. The rules are hard-coded in the submodule's file and are only applicable to German input.
bparseg: is an ML-based segmentation module that operates on syntactic constituency trees (output from BitPar) and decides whether a syntactic constituent initiates a discourse segment or not using a pre-trained linear SVM model. This model was trained on the German PCC corpus, but you can also train your own classifer for any language using your own training data (cf. discourse_segmenter --help for further instructions on how to do that).
mateseg: is an ML-based segmentation module that operates on syntactic dependency trees (output from Mate) and decides whether a sub-structure of the dependency graph initiates a discourse segment or not using a pre-trained linear SVM model. Again, this model was trained on the German PCC corpus.

Since the current model is a serialized file and, therefore, likely to be incompatible with future releases of `numpy`, we will probably remove the model files from future versions of this package, including source data instead and performing training during the installation.

Installation

To install this package from the PyPi index, run

pip install dsegmenter

Alternatively, you can also install it directly from the source repository by executing:

git clone git@github.com:discourse-lab/DiscourseSegmenter.git
pip install -r DiscourseSegmenter/requirements.txt DiscourseSegmenter/ --user

Usage

After installation, you can import the module in your python scripts (see an example here), e.g.:

from dsegmenter.bparseg import BparSegmenter

segmenter = BparSegmenter()

or, alternatively, also use the delivered front-end script discourse_segmenter to process your parsed input data, e.g.:

discourse_segmenter bparseg segment DiscourseSegmenter/examples/bpar/maz-8727.exb.bpar

Note that this script requires two mandatory arguments: the type of the segmenter to use (bparseg in the above case) and the operation to perform (which are specific to each segmenter).

Evaluation

Intrinsic evaluation scores of the machine learning models on the predicted vectors will be printed when training and evaluating a segmentation model.

Extrinsic evaluation scores on the predicted segmentation trees can be calculated with the evaluation script.

evaluation {FOLDER:TRUE} {FOLDER:PRED}

Note, that the script internally calls the DKpro agreement library, which requires Java 8.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
docs		docs
dsegmenter		dsegmenter
examples		examples
scripts		scripts
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discourse Segmenter

Description

Installation

Usage

Evaluation

About

Releases

Packages

Languages

License

discourse-lab/DiscourseSegmenter

Folders and files

Latest commit

History

Repository files navigation

Discourse Segmenter

Description

Installation

Usage

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages