How to use

Put the docs you want to dehyphenate in input_docs dir, then run the script. Output will be in output_docs

python3 cr_dehyphenator_es-ES.py

DEHYPHENATOR

This script removes hyphens that split words at the end of the lines. Usually words are split like this to keep the text aligned to the right margin, instead of using the word processor "justify" alignment. When doing NLP work on a text, these types of splits incorporate noise to the file, so we are attempting to remove them.

Rules

Last char of the stripped line must be a hyphen "-".
Character previous to the hyphen must be alphabetic
First character of the next line must be alphabetic

Future improvements

Prevent dehyphenating hyphenated proper nouns.

DESGUIONADOR

Este script quita el guion que separa las palabras al final de la línea. Normalmente las palabras se separan así para mantener el texto alineado al margen derecho, en vez de usar el alineamiento "justificado" del procesador de texto. Cuando se hace NLP en un texto, este tipo de separaciones introduce ruido en el documento.

Reglas

El último char de la línea debe ser un guion '-'
El char delante del guión debe ser alfabético
El primer char de la línea deber ser alfabético

Example / Ejemplo

"este script me ha sal- vado la vida" --> "este script me ha salvado la vida"

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cr_dehyphenator		cr_dehyphenator
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to use

DEHYPHENATOR

Rules

Future improvements

DESGUIONADOR

Reglas

Example / Ejemplo

About

Releases

Packages

Languages

License

mridpin/cr_dehyphenator

Folders and files

Latest commit

History

Repository files navigation

How to use

DEHYPHENATOR

Rules

Future improvements

DESGUIONADOR

Reglas

Example / Ejemplo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages