-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_SentenceSegmentation
Some useful features of a sentence segmentation tool (not necessarily important for Lars Jørgen's thesis):
- Domain/genre independent
- Identification of non-linguistic segments
- Mark-up aware
- Mark-up normalisation
- Stand-off annotation
Kiss, T. and Strunk, J. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4).
Implemented in the NLTK.
Briscoe, T., Carroll, J. and Watson, R. 2006. The second release of the RASP system. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.
Uses deterministic finite-state rules based on the immediate context (capitals, other punctuation etc.) to distinguish between periods used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or whitespace preceded by valid sentence final punctuation and followed by a capital letter. Jonathon has the source code... anyone know Flex!?
Only the usage is documented, but seems to rely on sets of (1) acceptable sentence boundary tokens; (2) tokens commonly following sentence boundaries; and (3) sentence boundary tokens to ignore. A major advantage is that it returns sentences with character offsets pointing back to the source text.
Home | Forum | Discussions | Events