WeSearch_SentenceSegmentation

Desiderata

Some useful features of a sentence segmentation tool (not necessarily important for Lars Jørgen's thesis):

Domain/genre independent
Identification of non-linguistic segments
Mark-up aware
Mark-up normalisation
Stand-off annotation

Related Work

Punkt

Kiss, T. and Strunk, J. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4).

RASP Sentence Boundary Detection

Briscoe, T., Carroll, J. and Watson, R. 2006. The second release of the RASP system. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.

Uses deterministic finite-state rules based on the immediate context (capitals, other punctuation etc.) to distinguish between periods used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or whitespace preceded by valid sentence final punctuation and followed by a capital letter. Jonathon has the source code... anyone know Flex!?

Stanford CoreNLP ssplit

Only the usage is documented, but seems to rely on sets of (1) acceptable sentence boundary tokens; (2) tokens commonly following sentence boundaries; and (3) sentence boundary tokens to ignore. A major advantage is that it returns sentences with character offsets pointing back to the source text.

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly