open-mmlab · Mountchicken · Jan 9, 2023 · Jan 12, 2023 · Jan 13, 2023 · Jan 18, 2023
diff --git a/...trecog/A holistic representation guided attention network for scene text recognition.yaml b/...trecog/A holistic representation guided attention network for scene text recognition.yaml
@@ -0,0 +1,74 @@
+Title: 'A holistic representation guided attention network for scene text recognition'
+Abbreviation: Yang et al.
+Tasks:
+ - TextRecog
+Venue: PR
+Year: 2020
+Lab/Company:
+ - School of Computer Science, Northwestern Polytechnical University, Xian, China
+URL:
+  Venue: 'https://www.sciencedirect.com/science/article/pii/S0925231220311176'
+  Arxiv: 'https://arxiv.org/abs/1904.01375'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Reading irregular scene text of arbitrary shape in natural images is
+still a challenging problem, despite the progress made recently. Many existing
+approaches incorporate sophisticated network structures to handle various shapes,
+use extra annotations for stronger supervision, or employ hard-to-train recurrent
+neural networks for sequence modeling. In this work, we propose a simple yet
+strong approach for scene text recognition. With no need to convert input images
+to sequence representations, we directly connect two-dimensional CNN features
+to an attention-based sequence decoder which guided by holistic representation.
+The holistic representation can guide the attention-based decoder focus on more
+accurate area. As no recurrent module is adopted, our model can be trained in
+parallel. It achieves 1:5 to 9:4 acceleration to backward pass and 1:3 to
+7:9 acceleration to forward pass, compared with the RNN counterparts. The
+proposed model is trained with only word-level annotations. With this simple
+design, our method achieves state-of-the-art or competitive recognition
+performance on the evaluated regular and irregular scene text benchmark
+datasets.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212223421-55532fde-a4fb-4fd1-ba8c-a1d057b03058.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 87.1
+     IIIT5K:
+       WAICS: 94.7
+     SVT:
+       WAICS: 88.9
+     IC13:
+       WAICS: 93.2
+     IC15:
+       WAICS: 79.5
+     SVTP:
+       WAICS: 80.9
+     CUTE:
+       WAICS: 85.4
+Bibtex: 'yang2020holistic,
+  title={A holistic representation guided attention network for scene text recognition},
+  author={Yang, Lu and Wang, Peng and Li, Hui and Li, Zhen and Zhang, Yanning},
+  journal={Neurocomputing},
+  volume={414},
+  pages={67--75},
+  year={2020},
+  publisher={Elsevier}
+}'
diff --git a/...g/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml b/...g/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml
@@ -0,0 +1,76 @@
+Title: 'From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network'
+Abbreviation: VisionLAN
+Tasks:
+ - TextRecog
+Venue: ICCV
+Year: 2021
+Lab/Company:
+ - University of Science and Technology of China
+ - Huawei Cloud & AI
+URL:
+  Venue: 'http://openaccess.thecvf.com/content/ICCV2021/html/Wang_From_Two_to_One_A_New_Scene_Text_Recognizer_With_ICCV_2021_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2108.09661'
+Paper Reading URL: 'https://mp.weixin.qq.com/s/YtYio-k139cKzCnn3R4YwA'
+Code: 'https://github.com/wangyuxin87/VisionLAN'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+ - Dataset
+Abstract: 'In this paper, we abandon the dominant complex language model and
+rethink the linguistic learning process in the scene text recognition. Different
+from previous methods considering the visual and linguistic information in two
+separate structures, we propose a Visual Language Modeling Network (VisionLAN),
+which views the visual and linguistic information as a union by directly enduing
+the vision model with language capability. Specially, we introduce the text
+recognition of character-wise occluded feature maps in the training stage. Such
+operation guides the vision model to use not only the visual texture of
+characters, but also the linguistic information in visual context for
+recognition when the visual cues are confused (e.g. occlusion, noise, etc.).
+As the linguistic information is acquired along with visual features without
+the need of extra language model, VisionLAN significantly improves the speed
+by 39% and adaptively considers the linguistic information to enhance the visual
+features for accurate recognition. Furthermore, an Occlusion Scene Text (OST)
+dataset is proposed to evaluate the performance on the case of missing
+characterwise visual cues. The state of-the-art results on several benchmarks
+prove our effectiveness. Code and dataset are available at
+https://github.com/wangyuxin87/ VisionLAN .'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230022-65678cf4-fdd9-4828-92ce-2d4e9a19bfac.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 90.2
+     IIIT5K:
+       WAICS: 95.8
+     SVT:
+       WAICS: 91.7
+     IC13:
+       WAICS: 95.7
+     IC15:
+       WAICS: 83.7
+     SVTP:
+       WAICS: 86.0
+     CUTE:
+       WAICS: 88.5
+Bibtex: '@inproceedings{wang2021two,
+  title={From two to one: A new scene text recognizer with visual language modeling network},
+  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
+  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  pages={14194--14203},
+  year={2021}
+}'
diff --git a/...og/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml b/...og/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml
@@ -0,0 +1,74 @@
+Title: 'GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition'
+Abbreviation: GTC
+Tasks:
+ - TextRecog
+Venue: AAAI
+Year: 2020
+Lab/Company:
+ - Nanyang Technological University
+ - SenseTime Group Ltd.
+URL:
+  Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6735'
+  Arxiv: 'https://arxiv.org/abs/2002.01276'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Connectionist Temporal Classification (CTC) and attention mechanism
+are two main approaches used in recent scene text recognition works. Compared
+with attention-based methods, CTC decoder has a much shorter inference time,
+yet a lower accuracy. To design an efficient and effective model, we propose
+the guided training of CTC (GTC), where CTC model learns a better alignment and
+feature representations from a more powerful attentional guidance. With the
+benefit of guided training, CTC model achieves robust and accurate prediction
+for both regular and irregular scene text while maintaining a fast inference
+speed. Moreover, to further leverage the potential of CTC decoder, a graph
+convolutional network (GCN) is proposed to learn the local correlations of
+extracted features. Extensive experiments on standard benchmarks demonstrate
+that our end-to-end model achieves a new state-of-the-art for regular and
+irregular scene text recognition and needs 6 times shorter inference time than
+attentionbased methods.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212222112-fc9f3490-003d-409c-874a-551aab414329.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 90.6
+     IIIT5K:
+       WAICS: 95.5
+     SVT:
+       WAICS: 92.9
+     IC13:
+       WAICS: 94.3
+     IC15:
+       WAICS: 82.5
+     SVTP:
+       WAICS: 86.2
+     CUTE:
+       WAICS: 92.3
+Bibtex: '@inproceedings{hu2020gtc,
+  title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},
+  author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},
+  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+  volume={34},
+  number={07},
+  pages={11005--11012},
+  year={2020}
+}'
diff --git a/...rn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml b/...rn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml
@@ -0,0 +1,76 @@
+Title: 'Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition'
+Abbreviation: Luo et al.
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2020
+Lab/Company:
+ - South China University of Technology
+ - Alibaba Group
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Luo_Learn_to_Augment_Joint_Data_Augmentation_and_Network_Optimization_for_CVPR_2020_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2003.06606'
+Paper Reading URL: N/A
+Code: 'https://github.com/Canjie-Luo/Text-Image-Augmentation'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Handwritten text and scene text suffer from various shapes and
+distorted patterns. Thus training a robust recognition model requires a large
+amount of data to cover diversity as much as possible. In contrast to data
+collection and annotation, data augmentation is a low cost way. In this paper,
+we propose a new method for text image augmentation. Different from traditional
+augmentation methods such as rotation, scaling and perspective transformation,
+our proposed augmentation method is designed to learn proper and efficient data
+augmentation which is more effective and specific for training a robust
+recognizer. By using a set of custom fiducial points, the proposed augmentation
+method is flexible and controllable. Furthermore, we bridge the gap between the
+isolated processes of data augmentation and network optimization by joint
+learning. An agent network learns from the output of the recognition network
+and controls the fiducial points to generate more proper training samples for
+the recognition network. Extensive experiments on various benchmarks, including
+regular scene text, irregular scene text and handwritten text, show that the
+proposed augmentation and the joint learning methods significantly boost the
+performance of the recognition networks. A general toolkit for geometric
+augmentation is available1.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230430-9b55473b-5cf8-4923-b977-fa14afe820c1.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: N/A
+     IIIT5K:
+       WAICS: N/A
+     SVT:
+       WAICS: N/A
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@@inproceedings{luo2020learn,
+  title={Learn to augment: Joint data augmentation and network optimization for text recognition},
+  author={Luo, Canjie and Zhu, Yuanzhi and Jin, Lianwen and Wang, Yongpan},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={13746--13755},
+  year={2020}
+}'
diff --git a/...xtrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml b/...xtrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml
@@ -0,0 +1,76 @@
+Title: 'MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition'
+Abbreviation: MORAN
+Tasks:
+ - TextRecog
+Venue: PR
+Year: 2019
+Lab/Company:
+ - School of Electronic and Information Engineering, South China University of Technology
+ - SCUT-Zhuhai Institute of Modern Industrial Innovation
+URL:
+  Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320319300263'
+  Arxiv: 'https://ui.adsabs.harvard.edu/abs/2019PatRe..90..109L/abstract'
+Paper Reading URL: N/A
+Code: 'https://github.com/Canjie-Luo/MORAN_v2'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Irregular text is widely used. However, it is considerably difficult
+to recognize because of its various shapes and distorted patterns. In this
+paper, we thus propose a multi-object rectified attention network (MORAN) for
+general scene text recognition. The MORAN consists of a multi-object
+rectification network and an attention-based sequence recognition network. The
+multi-object rectification network is designed for rectifying images that
+contain irregular text. It decreases the difficulty of recognition and enables
+the attention-based sequence recognition network to more easily read irregular
+text. It is trained in a weak supervision way, thus requiring only images and
+corresponding text labels. The attentionbased sequence recognition network
+focuses on target characters and sequentially outputs the predictions. Moreover,
+to improve the sensitivity of the attentionbased sequence recognition network,
+a fractional pickup method is proposed for an attention-based decoder in the
+training phase. With the rectification mechanism, the MORAN can read both
+regular and irregular scene text. Extensive experiments on various benchmarks
+are conducted, which show that the MORAN achieves state-of-the-art performance.
+The source code is available1.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230805-3d927214-c184-4d3d-818d-7072aac8f830.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 82.4
+     IIIT5K:
+       WAICS: 91.2
+     SVT:
+       WAICS: 88.3
+     IC13:
+       WAICS: 92.4
+     IC15:
+       WAICS: 68.8
+     SVTP:
+       WAICS: 76.1
+     CUTE:
+       WAICS: 77.4
+Bibtex: '@article{luo2019moran,
+  title={Moran: A multi-object rectified attention network for scene text recognition},
+  author={Luo, Canjie and Jin, Lianwen and Sun, Zenghui},
+  journal={Pattern Recognition},
+  volume={90},
+  pages={109--118},
+  year={2019},
+  publisher={Elsevier}
+}'