Skip to content

script that dehyphenates words at the end of the line

License

Notifications You must be signed in to change notification settings

mridpin/cr_dehyphenator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

How to use

Put the docs you want to dehyphenate in input_docs dir, then run the script. Output will be in output_docs

python3 cr_dehyphenator_es-ES.py

DEHYPHENATOR

This script removes hyphens that split words at the end of the lines. Usually words are split like this to keep the text aligned to the right margin, instead of using the word processor "justify" alignment. When doing NLP work on a text, these types of splits incorporate noise to the file, so we are attempting to remove them.

Rules

  1. Last char of the stripped line must be a hyphen "-".
  2. Character previous to the hyphen must be alphabetic
  3. First character of the next line must be alphabetic

Future improvements

  1. Prevent dehyphenating hyphenated proper nouns.

DESGUIONADOR

Este script quita el guion que separa las palabras al final de la línea. Normalmente las palabras se separan así para mantener el texto alineado al margen derecho, en vez de usar el alineamiento "justificado" del procesador de texto. Cuando se hace NLP en un texto, este tipo de separaciones introduce ruido en el documento.

Reglas

  1. El último char de la línea debe ser un guion '-'
  2. El char delante del guión debe ser alfabético
  3. El primer char de la línea deber ser alfabético

Example / Ejemplo

"este script me ha sal- vado la vida" --> "este script me ha salvado la vida"

About

script that dehyphenates words at the end of the line

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages