diff --git a/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml b/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml new file mode 100644 index 000000000..9becb5efd --- /dev/null +++ b/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml @@ -0,0 +1,74 @@ +Title: 'A holistic representation guided attention network for scene text recognition' +Abbreviation: Yang et al. +Tasks: + - TextRecog +Venue: PR +Year: 2020 +Lab/Company: + - School of Computer Science, Northwestern Polytechnical University, Xian, China +URL: + Venue: 'https://www.sciencedirect.com/science/article/pii/S0925231220311176' + Arxiv: 'https://arxiv.org/abs/1904.01375' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Reading irregular scene text of arbitrary shape in natural images is +still a challenging problem, despite the progress made recently. Many existing +approaches incorporate sophisticated network structures to handle various shapes, +use extra annotations for stronger supervision, or employ hard-to-train recurrent +neural networks for sequence modeling. In this work, we propose a simple yet +strong approach for scene text recognition. With no need to convert input images +to sequence representations, we directly connect two-dimensional CNN features +to an attention-based sequence decoder which guided by holistic representation. +The holistic representation can guide the attention-based decoder focus on more +accurate area. As no recurrent module is adopted, our model can be trained in +parallel. It achieves 1:5 to 9:4 acceleration to backward pass and 1:3 to +7:9 acceleration to forward pass, compared with the RNN counterparts. The +proposed model is trained with only word-level annotations. With this simple +design, our method achieves state-of-the-art or competitive recognition +performance on the evaluated regular and irregular scene text benchmark +datasets.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212223421-55532fde-a4fb-4fd1-ba8c-a1d057b03058.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 87.1 + IIIT5K: + WAICS: 94.7 + SVT: + WAICS: 88.9 + IC13: + WAICS: 93.2 + IC15: + WAICS: 79.5 + SVTP: + WAICS: 80.9 + CUTE: + WAICS: 85.4 +Bibtex: 'yang2020holistic, + title={A holistic representation guided attention network for scene text recognition}, + author={Yang, Lu and Wang, Peng and Li, Hui and Li, Zhen and Zhang, Yanning}, + journal={Neurocomputing}, + volume={414}, + pages={67--75}, + year={2020}, + publisher={Elsevier} +}' diff --git a/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml b/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml new file mode 100644 index 000000000..b9751f2b3 --- /dev/null +++ b/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml @@ -0,0 +1,76 @@ +Title: 'From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network' +Abbreviation: VisionLAN +Tasks: + - TextRecog +Venue: ICCV +Year: 2021 +Lab/Company: + - University of Science and Technology of China + - Huawei Cloud & AI +URL: + Venue: 'http://openaccess.thecvf.com/content/ICCV2021/html/Wang_From_Two_to_One_A_New_Scene_Text_Recognizer_With_ICCV_2021_paper.html' + Arxiv: 'https://arxiv.org/abs/2108.09661' +Paper Reading URL: 'https://mp.weixin.qq.com/s/YtYio-k139cKzCnn3R4YwA' +Code: 'https://github.com/wangyuxin87/VisionLAN' +Supported In MMOCR: N/S +PaperType: + - Algorithm + - Dataset +Abstract: 'In this paper, we abandon the dominant complex language model and +rethink the linguistic learning process in the scene text recognition. Different +from previous methods considering the visual and linguistic information in two +separate structures, we propose a Visual Language Modeling Network (VisionLAN), +which views the visual and linguistic information as a union by directly enduing +the vision model with language capability. Specially, we introduce the text +recognition of character-wise occluded feature maps in the training stage. Such +operation guides the vision model to use not only the visual texture of +characters, but also the linguistic information in visual context for +recognition when the visual cues are confused (e.g. occlusion, noise, etc.). +As the linguistic information is acquired along with visual features without +the need of extra language model, VisionLAN significantly improves the speed +by 39% and adaptively considers the linguistic information to enhance the visual +features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) +dataset is proposed to evaluate the performance on the case of missing +characterwise visual cues. The state of-the-art results on several benchmarks +prove our effectiveness. Code and dataset are available at +https://github.com/wangyuxin87/ VisionLAN .' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212230022-65678cf4-fdd9-4828-92ce-2d4e9a19bfac.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 90.2 + IIIT5K: + WAICS: 95.8 + SVT: + WAICS: 91.7 + IC13: + WAICS: 95.7 + IC15: + WAICS: 83.7 + SVTP: + WAICS: 86.0 + CUTE: + WAICS: 88.5 +Bibtex: '@inproceedings{wang2021two, + title={From two to one: A new scene text recognizer with visual language modeling network}, + author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong}, + booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, + pages={14194--14203}, + year={2021} +}' diff --git a/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml b/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml new file mode 100644 index 000000000..f27b2596a --- /dev/null +++ b/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml @@ -0,0 +1,74 @@ +Title: 'GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition' +Abbreviation: GTC +Tasks: + - TextRecog +Venue: AAAI +Year: 2020 +Lab/Company: + - Nanyang Technological University + - SenseTime Group Ltd. +URL: + Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6735' + Arxiv: 'https://arxiv.org/abs/2002.01276' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Connectionist Temporal Classification (CTC) and attention mechanism +are two main approaches used in recent scene text recognition works. Compared +with attention-based methods, CTC decoder has a much shorter inference time, +yet a lower accuracy. To design an efficient and effective model, we propose +the guided training of CTC (GTC), where CTC model learns a better alignment and +feature representations from a more powerful attentional guidance. With the +benefit of guided training, CTC model achieves robust and accurate prediction +for both regular and irregular scene text while maintaining a fast inference +speed. Moreover, to further leverage the potential of CTC decoder, a graph +convolutional network (GCN) is proposed to learn the local correlations of +extracted features. Extensive experiments on standard benchmarks demonstrate +that our end-to-end model achieves a new state-of-the-art for regular and +irregular scene text recognition and needs 6 times shorter inference time than +attentionbased methods.' +MODELS: + Architecture: + - CTC + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212222112-fc9f3490-003d-409c-874a-551aab414329.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 90.6 + IIIT5K: + WAICS: 95.5 + SVT: + WAICS: 92.9 + IC13: + WAICS: 94.3 + IC15: + WAICS: 82.5 + SVTP: + WAICS: 86.2 + CUTE: + WAICS: 92.3 +Bibtex: '@inproceedings{hu2020gtc, + title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition}, + author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping}, + booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, + volume={34}, + number={07}, + pages={11005--11012}, + year={2020} +}' diff --git a/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml b/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml new file mode 100644 index 000000000..447d78f3e --- /dev/null +++ b/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml @@ -0,0 +1,76 @@ +Title: 'Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition' +Abbreviation: Luo et al. +Tasks: + - TextRecog +Venue: CVPR +Year: 2020 +Lab/Company: + - South China University of Technology + - Alibaba Group +URL: + Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Luo_Learn_to_Augment_Joint_Data_Augmentation_and_Network_Optimization_for_CVPR_2020_paper.html' + Arxiv: 'https://arxiv.org/abs/2003.06606' +Paper Reading URL: N/A +Code: 'https://github.com/Canjie-Luo/Text-Image-Augmentation' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Handwritten text and scene text suffer from various shapes and +distorted patterns. Thus training a robust recognition model requires a large +amount of data to cover diversity as much as possible. In contrast to data +collection and annotation, data augmentation is a low cost way. In this paper, +we propose a new method for text image augmentation. Different from traditional +augmentation methods such as rotation, scaling and perspective transformation, +our proposed augmentation method is designed to learn proper and efficient data +augmentation which is more effective and specific for training a robust +recognizer. By using a set of custom fiducial points, the proposed augmentation +method is flexible and controllable. Furthermore, we bridge the gap between the +isolated processes of data augmentation and network optimization by joint +learning. An agent network learns from the output of the recognition network +and controls the fiducial points to generate more proper training samples for +the recognition network. Extensive experiments on various benchmarks, including +regular scene text, irregular scene text and handwritten text, show that the +proposed augmentation and the joint learning methods significantly boost the +performance of the recognition networks. A general toolkit for geometric +augmentation is available1.' +MODELS: + Architecture: + - CTC + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212230430-9b55473b-5cf8-4923-b977-fa14afe820c1.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: N/A + IIIT5K: + WAICS: N/A + SVT: + WAICS: N/A + IC13: + WAICS: N/A + IC15: + WAICS: N/A + SVTP: + WAICS: N/A + CUTE: + WAICS: N/A +Bibtex: '@@inproceedings{luo2020learn, + title={Learn to augment: Joint data augmentation and network optimization for text recognition}, + author={Luo, Canjie and Zhu, Yuanzhi and Jin, Lianwen and Wang, Yongpan}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={13746--13755}, + year={2020} +}' diff --git a/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml new file mode 100644 index 000000000..38a35813c --- /dev/null +++ b/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml @@ -0,0 +1,76 @@ +Title: 'MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition' +Abbreviation: MORAN +Tasks: + - TextRecog +Venue: PR +Year: 2019 +Lab/Company: + - School of Electronic and Information Engineering, South China University of Technology + - SCUT-Zhuhai Institute of Modern Industrial Innovation +URL: + Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320319300263' + Arxiv: 'https://ui.adsabs.harvard.edu/abs/2019PatRe..90..109L/abstract' +Paper Reading URL: N/A +Code: 'https://github.com/Canjie-Luo/MORAN_v2' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Irregular text is widely used. However, it is considerably difficult +to recognize because of its various shapes and distorted patterns. In this +paper, we thus propose a multi-object rectified attention network (MORAN) for +general scene text recognition. The MORAN consists of a multi-object +rectification network and an attention-based sequence recognition network. The +multi-object rectification network is designed for rectifying images that +contain irregular text. It decreases the difficulty of recognition and enables +the attention-based sequence recognition network to more easily read irregular +text. It is trained in a weak supervision way, thus requiring only images and +corresponding text labels. The attentionbased sequence recognition network +focuses on target characters and sequentially outputs the predictions. Moreover, +to improve the sensitivity of the attentionbased sequence recognition network, +a fractional pickup method is proposed for an attention-based decoder in the +training phase. With the rectification mechanism, the MORAN can read both +regular and irregular scene text. Extensive experiments on various benchmarks +are conducted, which show that the MORAN achieves state-of-the-art performance. +The source code is available1.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212230805-3d927214-c184-4d3d-818d-7072aac8f830.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 82.4 + IIIT5K: + WAICS: 91.2 + SVT: + WAICS: 88.3 + IC13: + WAICS: 92.4 + IC15: + WAICS: 68.8 + SVTP: + WAICS: 76.1 + CUTE: + WAICS: 77.4 +Bibtex: '@article{luo2019moran, + title={Moran: A multi-object rectified attention network for scene text recognition}, + author={Luo, Canjie and Jin, Lianwen and Sun, Zenghui}, + journal={Pattern Recognition}, + volume={90}, + pages={109--118}, + year={2019}, + publisher={Elsevier} +}' diff --git a/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml b/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml new file mode 100644 index 000000000..de58900a1 --- /dev/null +++ b/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml @@ -0,0 +1,75 @@ +Title: 'Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features' +Abbreviation: MATRN +Tasks: + - TextRecog +Venue: ECCV +Year: 2022 +Lab/Company: + - KAIST + - Clova AI Research +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_26' + Arxiv: 'https://ui.adsabs.harvard.edu/abs/2021arXiv211115263N/abstract' +Paper Reading URL: N/A +Code: 'https://github.com/byeonghu-na/MATRN' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Linguistic knowledge has brought great benefits to scene text +recognition by providing semantics to refine character sequences. However, since +linguistic knowledge has been applied individually on the output sequence, +previous methods have not fully utilized the semantics to understand visual +clues for text recognition. This paper introduces a novel method, called +Multi-modAl Text Recognition Network (MATRN), that enables interactions between +visual and semantic features for better recognition performances. Specifically, +MATRN identifies visual and semantic feature pairs and encodes spatial +information into semantic features. Based on the spatial encoding, visual +and semantic features are enhanced by referring to related features in the +other modality. Furthermore, MATRN stimulates combining semantic features into +visual features by hiding visual clues related to the character in the training +phase. Our experiments demonstrate that MATRN achieves state-of-theart performances +on seven benchmarks with large margins, while naive combinations of two +modalities show marginal improvements. Further ablative studies prove the +effectiveness of our proposed components. Our implementation will be publicly +available.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212087554-54ef9393-611e-4107-b40c-0d09568c0bbb.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 92.5 + IIIT5K: + WAICS: 96.7 + SVT: + WAICS: 94.9 + IC13: + WAICS: 95.8 + IC15: + WAICS: 82.9 + SVTP: + WAICS: 90.5 + CUTE: + WAICS: 94.1 +Bibtex: '@inproceedings{na2022multi, + title={Multi-modal text recognition networks: Interactive enhancements between visual and semantic features}, + author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae}, + booktitle={European Conference on Computer Vision}, + pages={446--463}, + year={2022}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml new file mode 100644 index 000000000..519c5fc1b --- /dev/null +++ b/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml @@ -0,0 +1,77 @@ +Title: 'PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition' +Abbreviation: PIMNet +Tasks: + - TextRecog +Venue: ACMMM +Year: 2021 +Lab/Company: + - Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China +URL: + Venue: 'https://dl.acm.org/doi/abs/10.1145/3474085.3475238' + Arxiv: 'https://arxiv.org/abs/2109.04145' +Paper Reading URL: N/A +Code: 'https://dl.acm.org/action/downloadSupplement?doi=10.1145%2F3474085.3475238&file=mfp0430aux.zip' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Nowadays, scene text recognition has attracted more and more attention +due to its various applications. Most state-of-the-art methods adopt an +encoder-decoder framework with attention mechanism, which generates text +autoregressively from left to right. Despite the convincing performance, the +speed is limited because of the one-by-one decoding strategy. As opposed to +autoregressive models, non-autoregressive models predict the results in parallel +with a much shorter inference time, but the accuracy falls behind the +autoregressive counterpart considerably. In this paper, we propose a Parallel, +Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. +Specifically, PIMNet adopts a parallel attention mechanism to predict the text +faster and an iterative generation mechanism to make the predictions more +accurate. In each iteration, the context information is fully explored. To +improve learning of the hidden layer, we exploit the mimicking learning in the +training phase, where an additional autoregressive decoder is adopted and the +parallel decoder mimics the autoregressive decoder with fitting outputs of the +hidden layer. With the shared backbone between the two decoders, the proposed +PIMNet can be trained end-to-end without pre-training. During inference, the +branch of the autoregressive decoder is removed for a faster speed. Extensive +experiments on public benchmarks demonstrate the effectiveness and efficiency +of PIMNet. Our code is available in the supplementary material.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212088808-8aaee96d-1505-4ed5-8326-314f36073488.png' + FPS: + DEVICE: 'NVIDIA M40' + ITEM: 35.2 + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + - Real + Test DataSets: + Avg.: 93.8 + IIIT5K: + WAICS: 96.7 + SVT: + WAICS: 94.7 + IC13: + WAICS: 95.4 + IC15: + WAICS: 85.9 + SVTP: + WAICS: 88.2 + CUTE: + WAICS: 92.7 +Bibtex: '@inproceedings{qiao2021pimnet, + title={PIMNet: a parallel, iterative and mimicking network for scene text recognition}, + author={Qiao, Zhi and Zhou, Yu and Wei, Jin and Wang, Wei and Zhang, Yuan and Jiang, Ning and Wang, Hongbin and Wang, Weiping}, + booktitle={Proceedings of the 29th ACM International Conference on Multimedia}, + pages={2046--2055}, + year={2021} +}' diff --git a/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml b/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml new file mode 100644 index 000000000..6a2599136 --- /dev/null +++ b/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml @@ -0,0 +1,71 @@ +Title: 'SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition' +Abbreviation: SEED +Tasks: + - TextRecog +Venue: CVPR +Year: 2020 +Lab/Company: + - Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China + - School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China +URL: + Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Qiao_SEED_Semantics_Enhanced_Encoder-Decoder_Framework_for_Scene_Text_Recognition_CVPR_2020_paper.html' + Arxiv: 'https://arxiv.org/abs/2005.10977' +Paper Reading URL: N/A +Code: 'https://github.com/Pay20Y/SEED' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text recognition is a hot research topic in computer vision. +Recently, many recognition methods based on the encoder-decoder framework have +been proposed, and they can handle scene texts of perspective distortion and +curve shape. Nevertheless, they still face lots of challenges like image blur, +uneven illumination, and incomplete characters. We argue that most encoder-decoder +methods are based on local visual features without explicit global semantic +information. In this work, we propose a semantics enhanced encoder-decoder +framework to robustly recognize low-quality scene texts. The semantic +information is used both in the encoder module for supervision and in the +decoder module for initializing. In particular, the state-of-theart ASTER +method is integrated into the proposed framework as an exemplar. Extensive +experiments demonstrate that the proposed framework is more robust for +low-quality text images, and achieves state-of-the-art results on several +benchmark datasets. The source code will be available.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212231212-43636b78-1fa7-40bf-83f2-f0ea281ca55f.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 86.9 + IIIT5K: + WAICS: 93.8 + SVT: + WAICS: 89.6 + IC13: + WAICS: 92.8 + IC15: + WAICS: 80.0 + SVTP: + WAICS: 81.4 + CUTE: + WAICS: 83.6 +Bibtex: '@inproceedings{qiao2020seed, + title={Seed: Semantics enhanced encoder-decoder framework for scene text recognition}, + author={Qiao, Zhi and Zhou, Yu and Yang, Dongbao and Zhou, Yucan and Wang, Weiping}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={13528--13537}, + year={2020} +}' diff --git a/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml new file mode 100644 index 000000000..cc23a03ad --- /dev/null +++ b/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml @@ -0,0 +1,76 @@ +Title: 'SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition' +Abbreviation: SPIN +Tasks: + - TextRecog +Venue: AAAI +Year: 2021 +Lab/Company: + - Shanghai Jiaotong University, China + - Hikvision Research Institute, China + - Zhejiang University, China +URL: + Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/16442' + Arxiv: 'https://arxiv.org/abs/2005.13117' +Paper Reading URL: N/A +Code: 'https://github.com/hikopensource/DAVAR-Lab-OCR' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Arbitrary text appearance poses a great challenge in scene text +recognition tasks. Existing works mostly handle with the problem in +consideration of the shape distortion, including perspective distortions, line +curvature or other style variations. Rectification (i.e., spatial transformers) +as the preprocessing stage is one popular approach and extensively studied. +However, chromatic difficulties in complex scenes have not been paid much +attention on. In this work, we introduce a new learnable geometric-unrelated +rectification, StructurePreserving Inner Offset Network (SPIN), which allows +the color manipulation of source data within the network. This differentiable +module can be inserted before any recognition architecture to ease the +downstream tasks, giving neural networks the ability to actively transform +input intensity rather than only the spatial rectification. It can also serve + as a complementary module to known spatial transformations and work in both + independent and collaborative ways with them. Extensive experiments show the + proposed transformation outperforms existing rectification networks and has + comparable performance among the state-of-the-arts.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/211321597-997c4f09-fceb-4fe6-89d6-774971d942ed.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: 11e7 + PARAMS: 2.31e6 + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 88.7 + IIIT5K: + WAICS: 95.2 + SVT: + WAICS: 90.9 + IC13: + WAICS: 94.8 + IC15: + WAICS: 79.5 + SVTP: + WAICS: 83.2 + CUTE: + WAICS: 87.5 +Bibtex: '@inproceedings{zhang2021spin, + title={SPIN: Structure-preserving inner offset network for scene text recognition}, + author={Zhang, Chengwei and Xu, Yunlu and Cheng, Zhanzhan and Pu, Shiliang and Niu, Yi and Wu, Fei and Zou, Futai}, + booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, + volume={35}, + number={4}, + pages={3305--3314}, + year={2021} +}' diff --git a/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml b/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml new file mode 100644 index 000000000..593e66037 --- /dev/null +++ b/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml @@ -0,0 +1,78 @@ +Title: 'What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels' +Abbreviation: Baek at al. +Tasks: + - TextRecog +Venue: CVOR +Year: 2021 +Lab/Company: + - The University of Tokyo +URL: + Venue: 'http://openaccess.thecvf.com/content/CVPR2021/html/Baek_What_if_We_Only_Use_Real_Datasets_for_Scene_Text_CVPR_2021_paper.html' + Arxiv: 'https://arxiv.org/abs/2103.04400' +Paper Reading URL: N/A +Code: 'https: //github.com/ku21fan/STR-Fewer-Labels' +Supported In MMOCR: N/S +PaperType: + - Algorithm + - Dataset +Abstract: 'Scene text recognition (STR) task has a common practice: All +state-of-the-art STR models are trained on large synthetic data. In contrast +to this practice, training STR models only on fewer real labels (STR with fewer +labels) is important when we have to train STR models without synthetic data: +for handwritten or artistic texts that are difficult to generate synthetically +and for languages other than English for which we do not always have synthetic +data. However, there has been implicit common knowledge that training STR +models on real data is nearly impossible because real data is insufficient. +We consider that this common knowledge has obstructed the study of STR with +fewer labels. In this work, we would like to reactivate STR with fewer labels +by disproving the common knowledge. We consolidate recently accumulated public +real data and show that we can train STR models satisfactorily only with real +labeled data. Subsequently, we find simple data augmentation to fully exploit +real data. Furthermore, we improve the models by collecting unlabeled data and +introducing semi- and self-supervised methods. As a result, we obtain a +competitive model to state-of-the-art methods. To the best of our knowledge, +this is the first study that 1) shows sufficient performance by only using +real labels and 2) introduces semi- and self-supervised methods into STR with +fewer labels. Our code and data are available: +https: //github.com/ku21fan/STR-Fewer-Labels.' +MODELS: + Architecture: + - CTC + - Attention + Learning Method: + - Supervised + - Self-Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/212229475-09e37af2-b48d-4977-aafd-9efb63570dff.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - Real + Test DataSets: + Avg.: 89.3 + IIIT5K: + WAICS: 94.8 + SVT: + WAICS: 91.3 + IC13: + WAICS: 94.0 + IC15: + WAICS: 80.6 + SVTP: + WAICS: 82.7 + CUTE: + WAICS: 88.1 +Bibtex: '@inproceedings{baek2021if, + title={What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels}, + author={Baek, Jeonghun and Matsui, Yusuke and Aizawa, Kiyoharu}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={3113--3122}, + year={2021} +}'