I no longer work on this project.
To sum things up, in order for this to work better:
- Use bigger model, MobileNet does not seem to suffice.
- Better training data, with more varied text angles, sizes and lengths.
- Clever loss functions and combinations of losses of different model branches.
Okay, here's a gif showing the current state of the model.
It ain't much but it's honest work ¯\_(ツ)_/¯
In progress. Current state:
The loss of detection part started to flatten,
so I experimented with the relative weights of
losses from both branches when computing grads.
This results in shifting the trade-off between
what will the model learn more: detect or recognize.
That is why the sudden drop in detection loss.
- Backbone model (resnet/mobilenet) feature extraction + fusion
- Text detection part
- RoI rotate part
- Text recognition part
- Data generation
- Fix detection box bug during inference
The problem is the code is still very messy and poorly structured.
Given that the main project pipeline is working, next steps are:
- Clean and restructure the code
- See if RoiRotatePart can be improved using affine transformation
Once above is finished:
- Explore data augmentation
- Training/testing pipelines
- Documentation