Text_Detection_Synthetic_DataGenerator

This project is aimed at generating synthetic data for training text localization/detection methods especially for OCR solutions.

Intuition behind the project

All most, all of the open source datasets for text localizations(ICDAR, SynthText,COCO, etc) are made of natural images. for instance, text inside an advertisement board, etc. But, most of the OCR problems involve, getting texts extracted from pdf documents/scanned images which, mostly contain texts with in white backgrounds. We dont actually need to train a localiser model with a lot of natural images in such a usecase, instead if we can train on data, in which texts are embedded in white background, it makes more sense.

Since scanned documents contain noise, i tried to generate data with multiple noise types also. Right now, the project support gaussian, salt&pepper, poisson and speckle noise types.

Requirements

Python (any version would do)
numpy
opencv
PIL
tqdm

How to run

The synthetic data generator generates random texts from a specidfied text corpus, by sampling from hundreds of font types and font sizes. specify all the requirements as arguments inside the run.py file and run the script.

python run.py

Note: Here, i have added a separate corpus for dates, since in my usecase, i had to detect a lot of dates within the images.

Output

The generator will generate syntetic data and save inside the output folder mentons as the argument. It also saves a ground truth file with same name as the image. Groundtruth file format is mainteained similar to ICDAR format

x0,y0,x1,y1,x2,y2,x3,y3,<text>

samples

Sample1 (with noise) Sample2 (without noise)

TODO

Right now this is in very naive shape. There are lot of additions we can do to create more useful data.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
dicts		dicts
fonts		fonts
output		output
samples		samples
README.md		README.md
generator.py		generator.py
run.py		run.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text_Detection_Synthetic_DataGenerator

Intuition behind the project

Requirements

How to run

Output

samples

TODO

About

Releases

Packages

Languages

vinayakarannil/Text_Detection_Synthetic_DataGenerator

Folders and files

Latest commit

History

Repository files navigation

Text_Detection_Synthetic_DataGenerator

Intuition behind the project

Requirements

How to run

Output

samples

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages