metalex is general tool in AGPL License for lexicographics and metalexicographics activities. For current development version of this tool, see MetaLex/v2.0
- metalex proceeds in this way (written in french)
- This is an example of this process used with metalex
I am a metalexicographer or linguist and I have paper dictionaries. I want to perform a diachronic study of the evolution of the wording of definitions in a collection of dictionaries available from period A to period B.
- Traditionally or at best, the contemporary metalexicographer (according to our point of view) would apply the following methodology :
1- Scanning of printed materials (Scan) and enhance its qualities 2- Optical reading of the pictures (Ocrisation) = extract articles content 3- Manual Error Corrections of text articles 4- Marking of the articles with regular standard 5- Metalexographical analysis / decryption of articles
- metalex through its modules operates in the same way by successively executing each of these tasks automatically.
1) MetaLex enhances the quality of dictionary images **metalex.ocrtext.normalizeImage.EnhanceImages().filter(f.DETAIL)** 2) MetaLex extract from dictionary images all dictionary articles **metalex.ocrtext.make_ocr.image_to_text()** 3) MetaLex corrects dictionary articles **metalex.ocrtext.make_text_well()** 4) MetaLex marking dictionary articles depending of some standard **metalex.xmlised.put_xml('tei') or MetaLex.xmlised.put_xml('lmf')** 5) MetaLex generates some metalexicographics analysis of part of content dictionary **metalex.xmlised.handleStat()**
- Some other more complex processes can be done !
MetaLex is developed in Python 2.7 environment, the following packages are required :
- We can install all package dependencies manually
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
sudo pip install Cython
sudo apt-get install libtesseract-dev libleptonica-dev libjpeg-dev zlib1g-dev libpng-dev
sudo apt-get install tesseract-ocr-all
sudo apt-get install python-html5lib
sudo apt-get install python-lxml
sudo apt-get install python-bs4
sudo pip install pillow
sudo pip install --no-cache-dir -I pillow
sudo pip install http://effbot.org/downloads/Imaging-1.1.7.tar.gz
sudo pip install termcolor
sudo CPPFLAGS=-I/usr/local/include pip install tesserocr
- Or follow these steps
sudo ./config.sh # Install linux package dependencies
sudo pip install -r requirements.txt # Install python module dependencies
- Or merely use (the preferred option is to begin with the extra linux packages)
sudo pip install metalex
- Go to the test/ folder and run build help command
python runMetalex.py -h
# metalex -h
---------------------------------------------------------------
| * * * * * * * * * * * * * * * * ** ** |
| * * * * * * * * * * * * * * |
| * * * * * * * * * * * * * * ** ** |
---------------------------------------------------------------
metalex is general tool for lexicographics and metalexicographics activities
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-p PROJECTNAME, --project PROJECTNAME
Defined metalex project name
-c author comment contributors, --confproject author comment contributors
Defined metalex configuration for the current project
-i [IMAGEFILE], --dicimage [IMAGEFILE]
Input one or multiple dictionary image(s) file(s) for
current metalex project
--dld DOWNLOAD Download ocropy model from Github for current metalex
project
-o {ocropy,tesserocr}, --ocrtype {ocropy,tesserocr}
OCR type to use for current metalex project
-m {modeldef,}, --model {modeldef,}
OCR LSTM model to use for current metalex project
-d IMAGESDIR, --imagedir IMAGESDIR
Input folder name of dictionary image files for
current metalex project
--imgalg actiontype value
Set algorithm for enhancing dictionary image files for
current metalex project (actiontype must be : contrast
or bright or filter)
-r FILERULE, --filerule FILERULE
Defined file rules that we use to enhance quality of
OCR result
-l LANG, --lang LANG Set language for optical characters recognition and
others metalex treatment
-x {xml,lmf,tei}, --xml {xml,lmf,tei}
Defined output result treatment of metalex
-s, --save Save output result of the current project in files
-t, --terminal Show result of the current treatment in the terminal
------------------------------------------------------------------------------
metalex project : special Thank to Bill for metalex-vagrant version
------------------------------------------------------------------------------
- Build the file rules of the project.
MetaLex takes files using specific structure to enhance output text of OCR data (from dictionary image files). \W for word replacement, \C for character replacement and \R for regular expression replacement. The spaces between headers are used to to describe remplacement.
\START \MetaLex\project_name\type_of_project\lang\author\date \W /t'/t /{/f. /E./f. \C /i'/i \R /a-z+/ij \END
- If you want to use ocropy OCR, please download its models first : It is save at $home/metalex/models.
# from source file
python runMetalex.py --dld modelDef
# when metalex is installed
metalex --dld modelDef
- Run your project with the default parameters except dictionary images data and save results. You must create a folder containing dictionary image files such as test-files/images/.
# [OCRopy OCR] We defined a folder containing dictionary images for current process
python runMetalex.py -d 'test-files/images' -o ocropy -m modeldef -s
# or metalex -d 'test-files/images' -o ocropy -m modeldef -s
# [Tesserocr OCR] Or you can define a single dictionary image file
python runMetalex.py -i 'test-files/images/LarClasIll_1911_gay-Trouin.jpg' -o tesserocr -m modeldef -s
# or metalex -i 'test-files/images/LarClasIll_1911_gay-Trouin.jpg' -o tesserocr -m modeldef -s
- Run your project with your own set of parameters and save results
python runMetalex.py -p 'projectname' -c 'author' 'comment' 'contributors' -d 'test-files/images' -r 'test-files/file_Rule.dic' -l 'fra' -o tesserocr -m modeldef -s
# metalex -p 'projectname' -c 'author' 'comment' 'contributors' -d 'test-files/images' -r 'test-files/file_Rule.dic' -l 'fra' -o tesserocr -m modeldef -s
- OUTPUT : For the first command (without parameters), the result in the console will produce this. NB : With parameters, error and warning messages will disappear.
Special thank to Bill for MetaLex-vagrant version for windows, Mac OS 6, Linux
Please don't forget to cite this work :
@article{Mboning-Elvis,
title = {Quand le TAL s'empare de la métalexicographie : conception d'un outil pour le métalexicographe},
author = {Mboning, Elvis},
url = {https://github.com/Levis0045/MetaLex},
date = {2017-06-20},
shool = {Université de Lille 3},
year = {2017},
pages = {12},
keywords = {métalexicographie, TAL, fouille de données, extraction d'information, lecture optique, lexicographie, Xmlisation, DTD}
}