textinfo

charfreq.py

There are a few scripts in the python directory that report information about text files.

The first is charfreq.py it will take an input folder and read all the files in that folder counting every occurence of each character. It writes two reports in csv format. character_summary.csv contains the total counts of all characters across all files. character_report.csv breaks down the counts by individual file.

To do:
Would be better to replace the character_summary with a fixed Excel sheet that calculates the totals.

transliterate.py

A hard coded transliteration table that converts Arabic characters whose unicode script is 'Inherited' to other Arabic characters. This is in an attempt to overcome the bug (now fixed) in SentencePiece that breaks words when it encounters an inherited script character.

findnames.py

Looks for words in a text file that begin with an uppercase letter. Looks at those which only occur first, and if they only occur in the first position they are removed. The remaining words are a first approximation of proper nouns. These are written to a file on the same line as where they were found. The hope is that this might help to improve the NLP processing of names which are not to be translated. It might also be a help for post editing manually.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
notebooks		notebooks
python		python
test		test
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
languageFamilies.json		languageFamilies.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textinfo

charfreq.py

transliterate.py

findnames.py

About

Releases

Packages

Contributors 3

Languages

License

davidbaines/textinfo

Folders and files

Latest commit

History

Repository files navigation

textinfo

charfreq.py

transliterate.py

findnames.py

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages