This project is aimed at generating synthetic data for training text localization/detection methods especially for OCR solutions.
All most, all of the open source datasets for text localizations(ICDAR, SynthText,COCO, etc) are made of natural images. for instance, text inside an advertisement board, etc. But, most of the OCR problems involve, getting texts extracted from pdf documents/scanned images which, mostly contain texts with in white backgrounds. We dont actually need to train a localiser model with a lot of natural images in such a usecase, instead if we can train on data, in which texts are embedded in white background, it makes more sense.
Since scanned documents contain noise, i tried to generate data with multiple noise types also. Right now, the project support gaussian, salt&pepper, poisson and speckle noise types.
- Python (any version would do)
- numpy
- opencv
- PIL
- tqdm
The synthetic data generator generates random texts from a specidfied text corpus, by sampling from hundreds of font types and font sizes. specify all the requirements as arguments inside the run.py file and run the script.
python run.py
Note: Here, i have added a separate corpus for dates, since in my usecase, i had to detect a lot of dates within the images.
The generator will generate syntetic data and save inside the output folder mentons as the argument. It also saves a ground truth file with same name as the image. Groundtruth file format is mainteained similar to ICDAR format
x0,y0,x1,y1,x2,y2,x3,y3,<text>
Sample1 (with noise) Sample2 (without noise)
Right now this is in very naive shape. There are lot of additions we can do to create more useful data.
- sampling from font styles
- sampling from font sizes
- sampling text from corpus
- sampling text from wikipedia
- adding more noise types
- option to choose background image