Skip to content

WeSearch_SentenceSegmentation

JonathonRead edited this page Jun 7, 2012 · 23 revisions

Desiderata

Some useful features of a sentence segmentation tool (not necessarily important for Lars Jørgen's thesis):

  • Domain/genre independent
  • Identification of non-linguistic segments
  • Mark-up aware
  • Mark-up normalisation
  • Stand-off annotation

Related Work

Punkt

Kiss, T. and Strunk, J. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4).

Implemented in the NLTK.

RASP Sentence Boundary Detection

Briscoe, T., Carroll, J. and Watson, R. 2006. The second release of the RASP system. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.

Uses deterministic finite-state rules based on the immediate context (capitals, other punctuation etc.) to distinguish between periods used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or whitespace preceded by valid sentence final punctuation and followed by a capital letter. Jonathon has the source code... anyone know Flex!?

Stanford CoreNLP ssplit

Only the usage is documented, but seems to rely on sets of (1) acceptable sentence boundary tokens; (2) tokens commonly following sentence boundaries; and (3) sentence boundary tokens to ignore. A major advantage is that it returns sentences with character offsets pointing back to the source text.

Clone this wiki locally