Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paper List-4] Add 10 recog papers #1676

Open
wants to merge 5 commits into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'A holistic representation guided attention network for scene text recognition'
Abbreviation: Yang et al.
Tasks:
- TextRecog
Venue: PR
Year: 2020
Lab/Company:
- School of Computer Science, Northwestern Polytechnical University, Xian, China
URL:
Venue: 'https://www.sciencedirect.com/science/article/pii/S0925231220311176'
Arxiv: 'https://arxiv.org/abs/1904.01375'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Reading irregular scene text of arbitrary shape in natural images is
still a challenging problem, despite the progress made recently. Many existing
approaches incorporate sophisticated network structures to handle various shapes,
use extra annotations for stronger supervision, or employ hard-to-train recurrent
neural networks for sequence modeling. In this work, we propose a simple yet
strong approach for scene text recognition. With no need to convert input images
to sequence representations, we directly connect two-dimensional CNN features
to an attention-based sequence decoder which guided by holistic representation.
The holistic representation can guide the attention-based decoder focus on more
accurate area. As no recurrent module is adopted, our model can be trained in
parallel. It achieves 1:5 to 9:4 acceleration to backward pass and 1:3 to
7:9 acceleration to forward pass, compared with the RNN counterparts. The
proposed model is trained with only word-level annotations. With this simple
design, our method achieves state-of-the-art or competitive recognition
performance on the evaluated regular and irregular scene text benchmark
datasets.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/212223421-55532fde-a4fb-4fd1-ba8c-a1d057b03058.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 87.1
IIIT5K:
WAICS: 94.7
SVT:
WAICS: 88.9
IC13:
WAICS: 93.2
IC15:
WAICS: 79.5
SVTP:
WAICS: 80.9
CUTE:
WAICS: 85.4
Bibtex: 'yang2020holistic,
title={A holistic representation guided attention network for scene text recognition},
author={Yang, Lu and Wang, Peng and Li, Hui and Li, Zhen and Zhang, Yanning},
journal={Neurocomputing},
volume={414},
pages={67--75},
year={2020},
publisher={Elsevier}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network'
Abbreviation: VisionLAN
Tasks:
- TextRecog
Venue: ICCV
Year: 2021
Lab/Company:
- University of Science and Technology of China
- Huawei Cloud & AI
URL:
Venue: 'http://openaccess.thecvf.com/content/ICCV2021/html/Wang_From_Two_to_One_A_New_Scene_Text_Recognizer_With_ICCV_2021_paper.html'
Arxiv: 'https://arxiv.org/abs/2108.09661'
Paper Reading URL: 'https://mp.weixin.qq.com/s/YtYio-k139cKzCnn3R4YwA'
Code: 'https://github.com/wangyuxin87/VisionLAN'
Supported In MMOCR: N/S
PaperType:
- Algorithm
- Dataset
Abstract: 'In this paper, we abandon the dominant complex language model and
rethink the linguistic learning process in the scene text recognition. Different
from previous methods considering the visual and linguistic information in two
separate structures, we propose a Visual Language Modeling Network (VisionLAN),
which views the visual and linguistic information as a union by directly enduing
the vision model with language capability. Specially, we introduce the text
recognition of character-wise occluded feature maps in the training stage. Such
operation guides the vision model to use not only the visual texture of
characters, but also the linguistic information in visual context for
recognition when the visual cues are confused (e.g. occlusion, noise, etc.).
As the linguistic information is acquired along with visual features without
the need of extra language model, VisionLAN significantly improves the speed
by 39% and adaptively considers the linguistic information to enhance the visual
features for accurate recognition. Furthermore, an Occlusion Scene Text (OST)
dataset is proposed to evaluate the performance on the case of missing
characterwise visual cues. The state of-the-art results on several benchmarks
prove our effectiveness. Code and dataset are available at
https://github.com/wangyuxin87/ VisionLAN .'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/212230022-65678cf4-fdd9-4828-92ce-2d4e9a19bfac.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 90.2
IIIT5K:
WAICS: 95.8
SVT:
WAICS: 91.7
IC13:
WAICS: 95.7
IC15:
WAICS: 83.7
SVTP:
WAICS: 86.0
CUTE:
WAICS: 88.5
Bibtex: '@inproceedings{wang2021two,
title={From two to one: A new scene text recognizer with visual language modeling network},
author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={14194--14203},
year={2021}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition'
Abbreviation: GTC
Tasks:
- TextRecog
Venue: AAAI
Year: 2020
Lab/Company:
- Nanyang Technological University
- SenseTime Group Ltd.
URL:
Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6735'
Arxiv: 'https://arxiv.org/abs/2002.01276'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Connectionist Temporal Classification (CTC) and attention mechanism
are two main approaches used in recent scene text recognition works. Compared
with attention-based methods, CTC decoder has a much shorter inference time,
yet a lower accuracy. To design an efficient and effective model, we propose
the guided training of CTC (GTC), where CTC model learns a better alignment and
feature representations from a more powerful attentional guidance. With the
benefit of guided training, CTC model achieves robust and accurate prediction
for both regular and irregular scene text while maintaining a fast inference
speed. Moreover, to further leverage the potential of CTC decoder, a graph
convolutional network (GCN) is proposed to learn the local correlations of
extracted features. Extensive experiments on standard benchmarks demonstrate
that our end-to-end model achieves a new state-of-the-art for regular and
irregular scene text recognition and needs 6 times shorter inference time than
attentionbased methods.'
MODELS:
Architecture:
- CTC
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/212222112-fc9f3490-003d-409c-874a-551aab414329.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 90.6
IIIT5K:
WAICS: 95.5
SVT:
WAICS: 92.9
IC13:
WAICS: 94.3
IC15:
WAICS: 82.5
SVTP:
WAICS: 86.2
CUTE:
WAICS: 92.3
Bibtex: '@inproceedings{hu2020gtc,
title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},
author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={34},
number={07},
pages={11005--11012},
year={2020}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition'
Abbreviation: Luo et al.
Tasks:
- TextRecog
Venue: CVPR
Year: 2020
Lab/Company:
- South China University of Technology
- Alibaba Group
URL:
Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Luo_Learn_to_Augment_Joint_Data_Augmentation_and_Network_Optimization_for_CVPR_2020_paper.html'
Arxiv: 'https://arxiv.org/abs/2003.06606'
Paper Reading URL: N/A
Code: 'https://github.com/Canjie-Luo/Text-Image-Augmentation'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Handwritten text and scene text suffer from various shapes and
distorted patterns. Thus training a robust recognition model requires a large
amount of data to cover diversity as much as possible. In contrast to data
collection and annotation, data augmentation is a low cost way. In this paper,
we propose a new method for text image augmentation. Different from traditional
augmentation methods such as rotation, scaling and perspective transformation,
our proposed augmentation method is designed to learn proper and efficient data
augmentation which is more effective and specific for training a robust
recognizer. By using a set of custom fiducial points, the proposed augmentation
method is flexible and controllable. Furthermore, we bridge the gap between the
isolated processes of data augmentation and network optimization by joint
learning. An agent network learns from the output of the recognition network
and controls the fiducial points to generate more proper training samples for
the recognition network. Extensive experiments on various benchmarks, including
regular scene text, irregular scene text and handwritten text, show that the
proposed augmentation and the joint learning methods significantly boost the
performance of the recognition networks. A general toolkit for geometric
augmentation is available1.'
MODELS:
Architecture:
- CTC
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/212230430-9b55473b-5cf8-4923-b977-fa14afe820c1.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: N/A
IIIT5K:
WAICS: N/A
SVT:
WAICS: N/A
IC13:
WAICS: N/A
IC15:
WAICS: N/A
SVTP:
WAICS: N/A
CUTE:
WAICS: N/A
Bibtex: '@@inproceedings{luo2020learn,
title={Learn to augment: Joint data augmentation and network optimization for text recognition},
author={Luo, Canjie and Zhu, Yuanzhi and Jin, Lianwen and Wang, Yongpan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13746--13755},
year={2020}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition'
Abbreviation: MORAN
Tasks:
- TextRecog
Venue: PR
Year: 2019
Lab/Company:
- School of Electronic and Information Engineering, South China University of Technology
- SCUT-Zhuhai Institute of Modern Industrial Innovation
URL:
Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320319300263'
Arxiv: 'https://ui.adsabs.harvard.edu/abs/2019PatRe..90..109L/abstract'
Paper Reading URL: N/A
Code: 'https://github.com/Canjie-Luo/MORAN_v2'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Irregular text is widely used. However, it is considerably difficult
to recognize because of its various shapes and distorted patterns. In this
paper, we thus propose a multi-object rectified attention network (MORAN) for
general scene text recognition. The MORAN consists of a multi-object
rectification network and an attention-based sequence recognition network. The
multi-object rectification network is designed for rectifying images that
contain irregular text. It decreases the difficulty of recognition and enables
the attention-based sequence recognition network to more easily read irregular
text. It is trained in a weak supervision way, thus requiring only images and
corresponding text labels. The attentionbased sequence recognition network
focuses on target characters and sequentially outputs the predictions. Moreover,
to improve the sensitivity of the attentionbased sequence recognition network,
a fractional pickup method is proposed for an attention-based decoder in the
training phase. With the rectification mechanism, the MORAN can read both
regular and irregular scene text. Extensive experiments on various benchmarks
are conducted, which show that the MORAN achieves state-of-the-art performance.
The source code is available1.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/212230805-3d927214-c184-4d3d-818d-7072aac8f830.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 82.4
IIIT5K:
WAICS: 91.2
SVT:
WAICS: 88.3
IC13:
WAICS: 92.4
IC15:
WAICS: 68.8
SVTP:
WAICS: 76.1
CUTE:
WAICS: 77.4
Bibtex: '@article{luo2019moran,
title={Moran: A multi-object rectified attention network for scene text recognition},
author={Luo, Canjie and Jin, Lianwen and Sun, Zenghui},
journal={Pattern Recognition},
volume={90},
pages={109--118},
year={2019},
publisher={Elsevier}
}'
Loading