diff --git a/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml b/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml
new file mode 100644
index 000000000..9becb5efd
--- /dev/null
+++ b/paper_zoo/textrecog/A holistic representation guided attention network for scene text recognition.yaml	
@@ -0,0 +1,74 @@
+Title: 'A holistic representation guided attention network for scene text recognition'
+Abbreviation: Yang et al.
+Tasks:
+ - TextRecog
+Venue: PR
+Year: 2020
+Lab/Company:
+ - School of Computer Science, Northwestern Polytechnical University, Xian, China
+URL:
+  Venue: 'https://www.sciencedirect.com/science/article/pii/S0925231220311176'
+  Arxiv: 'https://arxiv.org/abs/1904.01375'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Reading irregular scene text of arbitrary shape in natural images is
+still a challenging problem, despite the progress made recently. Many existing
+approaches incorporate sophisticated network structures to handle various shapes,
+use extra annotations for stronger supervision, or employ hard-to-train recurrent
+neural networks for sequence modeling. In this work, we propose a simple yet
+strong approach for scene text recognition. With no need to convert input images
+to sequence representations, we directly connect two-dimensional CNN features
+to an attention-based sequence decoder which guided by holistic representation.
+The holistic representation can guide the attention-based decoder focus on more
+accurate area. As no recurrent module is adopted, our model can be trained in
+parallel. It achieves 1:5 to 9:4 acceleration to backward pass and 1:3 to
+7:9 acceleration to forward pass, compared with the RNN counterparts. The
+proposed model is trained with only word-level annotations. With this simple
+design, our method achieves state-of-the-art or competitive recognition
+performance on the evaluated regular and irregular scene text benchmark
+datasets.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212223421-55532fde-a4fb-4fd1-ba8c-a1d057b03058.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 87.1
+     IIIT5K:
+       WAICS: 94.7
+     SVT:
+       WAICS: 88.9
+     IC13:
+       WAICS: 93.2
+     IC15:
+       WAICS: 79.5
+     SVTP:
+       WAICS: 80.9
+     CUTE:
+       WAICS: 85.4
+Bibtex: 'yang2020holistic,
+  title={A holistic representation guided attention network for scene text recognition},
+  author={Yang, Lu and Wang, Peng and Li, Hui and Li, Zhen and Zhang, Yanning},
+  journal={Neurocomputing},
+  volume={414},
+  pages={67--75},
+  year={2020},
+  publisher={Elsevier}
+}'
diff --git a/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml b/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml
new file mode 100644
index 000000000..b9751f2b3
--- /dev/null
+++ b/paper_zoo/textrecog/From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network.yaml	
@@ -0,0 +1,76 @@
+Title: 'From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network'
+Abbreviation: VisionLAN
+Tasks:
+ - TextRecog
+Venue: ICCV
+Year: 2021
+Lab/Company:
+ - University of Science and Technology of China
+ - Huawei Cloud & AI
+URL:
+  Venue: 'http://openaccess.thecvf.com/content/ICCV2021/html/Wang_From_Two_to_One_A_New_Scene_Text_Recognizer_With_ICCV_2021_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2108.09661'
+Paper Reading URL: 'https://mp.weixin.qq.com/s/YtYio-k139cKzCnn3R4YwA'
+Code: 'https://github.com/wangyuxin87/VisionLAN'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+ - Dataset
+Abstract: 'In this paper, we abandon the dominant complex language model and
+rethink the linguistic learning process in the scene text recognition. Different
+from previous methods considering the visual and linguistic information in two
+separate structures, we propose a Visual Language Modeling Network (VisionLAN),
+which views the visual and linguistic information as a union by directly enduing
+the vision model with language capability. Specially, we introduce the text
+recognition of character-wise occluded feature maps in the training stage. Such
+operation guides the vision model to use not only the visual texture of
+characters, but also the linguistic information in visual context for
+recognition when the visual cues are confused (e.g. occlusion, noise, etc.).
+As the linguistic information is acquired along with visual features without
+the need of extra language model, VisionLAN significantly improves the speed
+by 39% and adaptively considers the linguistic information to enhance the visual
+features for accurate recognition. Furthermore, an Occlusion Scene Text (OST)
+dataset is proposed to evaluate the performance on the case of missing
+characterwise visual cues. The state of-the-art results on several benchmarks
+prove our effectiveness. Code and dataset are available at
+https://github.com/wangyuxin87/ VisionLAN .'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230022-65678cf4-fdd9-4828-92ce-2d4e9a19bfac.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 90.2
+     IIIT5K:
+       WAICS: 95.8
+     SVT:
+       WAICS: 91.7
+     IC13:
+       WAICS: 95.7
+     IC15:
+       WAICS: 83.7
+     SVTP:
+       WAICS: 86.0
+     CUTE:
+       WAICS: 88.5
+Bibtex: '@inproceedings{wang2021two,
+  title={From two to one: A new scene text recognizer with visual language modeling network},
+  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
+  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  pages={14194--14203},
+  year={2021}
+}'
diff --git a/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml b/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml
new file mode 100644
index 000000000..f27b2596a
--- /dev/null
+++ b/paper_zoo/textrecog/GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition.yaml	
@@ -0,0 +1,74 @@
+Title: 'GTC: Guided Training of CTC towards Efficient and Accurate Scene Text Recognition'
+Abbreviation: GTC
+Tasks:
+ - TextRecog
+Venue: AAAI
+Year: 2020
+Lab/Company:
+ - Nanyang Technological University
+ - SenseTime Group Ltd.
+URL:
+  Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6735'
+  Arxiv: 'https://arxiv.org/abs/2002.01276'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Connectionist Temporal Classification (CTC) and attention mechanism
+are two main approaches used in recent scene text recognition works. Compared
+with attention-based methods, CTC decoder has a much shorter inference time,
+yet a lower accuracy. To design an efficient and effective model, we propose
+the guided training of CTC (GTC), where CTC model learns a better alignment and
+feature representations from a more powerful attentional guidance. With the
+benefit of guided training, CTC model achieves robust and accurate prediction
+for both regular and irregular scene text while maintaining a fast inference
+speed. Moreover, to further leverage the potential of CTC decoder, a graph
+convolutional network (GCN) is proposed to learn the local correlations of
+extracted features. Extensive experiments on standard benchmarks demonstrate
+that our end-to-end model achieves a new state-of-the-art for regular and
+irregular scene text recognition and needs 6 times shorter inference time than
+attentionbased methods.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212222112-fc9f3490-003d-409c-874a-551aab414329.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 90.6
+     IIIT5K:
+       WAICS: 95.5
+     SVT:
+       WAICS: 92.9
+     IC13:
+       WAICS: 94.3
+     IC15:
+       WAICS: 82.5
+     SVTP:
+       WAICS: 86.2
+     CUTE:
+       WAICS: 92.3
+Bibtex: '@inproceedings{hu2020gtc,
+  title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},
+  author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},
+  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+  volume={34},
+  number={07},
+  pages={11005--11012},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml b/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml
new file mode 100644
index 000000000..447d78f3e
--- /dev/null
+++ b/paper_zoo/textrecog/Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition.yaml	
@@ -0,0 +1,76 @@
+Title: 'Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition'
+Abbreviation: Luo et al.
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2020
+Lab/Company:
+ - South China University of Technology
+ - Alibaba Group
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Luo_Learn_to_Augment_Joint_Data_Augmentation_and_Network_Optimization_for_CVPR_2020_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2003.06606'
+Paper Reading URL: N/A
+Code: 'https://github.com/Canjie-Luo/Text-Image-Augmentation'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Handwritten text and scene text suffer from various shapes and
+distorted patterns. Thus training a robust recognition model requires a large
+amount of data to cover diversity as much as possible. In contrast to data
+collection and annotation, data augmentation is a low cost way. In this paper,
+we propose a new method for text image augmentation. Different from traditional
+augmentation methods such as rotation, scaling and perspective transformation,
+our proposed augmentation method is designed to learn proper and efficient data
+augmentation which is more effective and specific for training a robust
+recognizer. By using a set of custom fiducial points, the proposed augmentation
+method is flexible and controllable. Furthermore, we bridge the gap between the
+isolated processes of data augmentation and network optimization by joint
+learning. An agent network learns from the output of the recognition network
+and controls the fiducial points to generate more proper training samples for
+the recognition network. Extensive experiments on various benchmarks, including
+regular scene text, irregular scene text and handwritten text, show that the
+proposed augmentation and the joint learning methods significantly boost the
+performance of the recognition networks. A general toolkit for geometric
+augmentation is available1.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230430-9b55473b-5cf8-4923-b977-fa14afe820c1.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: N/A
+     IIIT5K:
+       WAICS: N/A
+     SVT:
+       WAICS: N/A
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@@inproceedings{luo2020learn,
+  title={Learn to augment: Joint data augmentation and network optimization for text recognition},
+  author={Luo, Canjie and Zhu, Yuanzhi and Jin, Lianwen and Wang, Yongpan},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={13746--13755},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml
new file mode 100644
index 000000000..38a35813c
--- /dev/null
+++ b/paper_zoo/textrecog/MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition.yaml	
@@ -0,0 +1,76 @@
+Title: 'MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition'
+Abbreviation: MORAN
+Tasks:
+ - TextRecog
+Venue: PR
+Year: 2019
+Lab/Company:
+ - School of Electronic and Information Engineering, South China University of Technology
+ - SCUT-Zhuhai Institute of Modern Industrial Innovation
+URL:
+  Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320319300263'
+  Arxiv: 'https://ui.adsabs.harvard.edu/abs/2019PatRe..90..109L/abstract'
+Paper Reading URL: N/A
+Code: 'https://github.com/Canjie-Luo/MORAN_v2'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Irregular text is widely used. However, it is considerably difficult
+to recognize because of its various shapes and distorted patterns. In this
+paper, we thus propose a multi-object rectified attention network (MORAN) for
+general scene text recognition. The MORAN consists of a multi-object
+rectification network and an attention-based sequence recognition network. The
+multi-object rectification network is designed for rectifying images that
+contain irregular text. It decreases the difficulty of recognition and enables
+the attention-based sequence recognition network to more easily read irregular
+text. It is trained in a weak supervision way, thus requiring only images and
+corresponding text labels. The attentionbased sequence recognition network
+focuses on target characters and sequentially outputs the predictions. Moreover,
+to improve the sensitivity of the attentionbased sequence recognition network,
+a fractional pickup method is proposed for an attention-based decoder in the
+training phase. With the rectification mechanism, the MORAN can read both
+regular and irregular scene text. Extensive experiments on various benchmarks
+are conducted, which show that the MORAN achieves state-of-the-art performance.
+The source code is available1.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212230805-3d927214-c184-4d3d-818d-7072aac8f830.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 82.4
+     IIIT5K:
+       WAICS: 91.2
+     SVT:
+       WAICS: 88.3
+     IC13:
+       WAICS: 92.4
+     IC15:
+       WAICS: 68.8
+     SVTP:
+       WAICS: 76.1
+     CUTE:
+       WAICS: 77.4
+Bibtex: '@article{luo2019moran,
+  title={Moran: A multi-object rectified attention network for scene text recognition},
+  author={Luo, Canjie and Jin, Lianwen and Sun, Zenghui},
+  journal={Pattern Recognition},
+  volume={90},
+  pages={109--118},
+  year={2019},
+  publisher={Elsevier}
+}'
diff --git a/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml b/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml
new file mode 100644
index 000000000..de58900a1
--- /dev/null
+++ b/paper_zoo/textrecog/Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features.yaml	
@@ -0,0 +1,75 @@
+Title: 'Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features'
+Abbreviation: MATRN
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - KAIST
+ - Clova AI Research
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_26'
+  Arxiv: 'https://ui.adsabs.harvard.edu/abs/2021arXiv211115263N/abstract'
+Paper Reading URL: N/A
+Code: 'https://github.com/byeonghu-na/MATRN'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Linguistic knowledge has brought great benefits to scene text
+recognition by providing semantics to refine character sequences. However, since
+linguistic knowledge has been applied individually on the output sequence,
+previous methods have not fully utilized the semantics to understand visual
+clues for text recognition. This paper introduces a novel method, called
+Multi-modAl Text Recognition Network (MATRN), that enables interactions between
+visual and semantic features for better recognition performances. Specifically,
+MATRN identifies visual and semantic feature pairs and encodes spatial
+information into semantic features. Based on the spatial encoding, visual
+and semantic features are enhanced by referring to related features in the
+other modality. Furthermore, MATRN stimulates combining semantic features into
+visual features by hiding visual clues related to the character in the training
+phase. Our experiments demonstrate that MATRN achieves state-of-theart performances
+on seven benchmarks with large margins, while naive combinations of two
+modalities show marginal improvements. Further ablative studies prove the
+effectiveness of our proposed components. Our implementation will be publicly
+available.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212087554-54ef9393-611e-4107-b40c-0d09568c0bbb.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 92.5
+     IIIT5K:
+       WAICS: 96.7
+     SVT:
+       WAICS: 94.9
+     IC13:
+       WAICS: 95.8
+     IC15:
+       WAICS: 82.9
+     SVTP:
+       WAICS: 90.5
+     CUTE:
+       WAICS: 94.1
+Bibtex: '@inproceedings{na2022multi,
+  title={Multi-modal text recognition networks: Interactive enhancements between visual and semantic features},
+  author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae},
+  booktitle={European Conference on Computer Vision},
+  pages={446--463},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml
new file mode 100644
index 000000000..519c5fc1b
--- /dev/null
+++ b/paper_zoo/textrecog/PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition.yaml	
@@ -0,0 +1,77 @@
+Title: 'PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition'
+Abbreviation: PIMNet
+Tasks:
+ - TextRecog
+Venue: ACMMM
+Year: 2021
+Lab/Company:
+ - Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
+URL:
+  Venue: 'https://dl.acm.org/doi/abs/10.1145/3474085.3475238'
+  Arxiv: 'https://arxiv.org/abs/2109.04145'
+Paper Reading URL: N/A
+Code: 'https://dl.acm.org/action/downloadSupplement?doi=10.1145%2F3474085.3475238&file=mfp0430aux.zip'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Nowadays, scene text recognition has attracted more and more attention
+due to its various applications. Most state-of-the-art methods adopt an
+encoder-decoder framework with attention mechanism, which generates text
+autoregressively from left to right. Despite the convincing performance, the
+speed is limited because of the one-by-one decoding strategy. As opposed to
+autoregressive models, non-autoregressive models predict the results in parallel
+with a much shorter inference time, but the accuracy falls behind the
+autoregressive counterpart considerably. In this paper, we propose a Parallel,
+Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency.
+Specifically, PIMNet adopts a parallel attention mechanism to predict the text
+faster and an iterative generation mechanism to make the predictions more
+accurate. In each iteration, the context information is fully explored. To
+improve learning of the hidden layer, we exploit the mimicking learning in the
+training phase, where an additional autoregressive decoder is adopted and the
+parallel decoder mimics the autoregressive decoder with fitting outputs of the
+hidden layer. With the shared backbone between the two decoders, the proposed
+PIMNet can be trained end-to-end without pre-training. During inference, the
+branch of the autoregressive decoder is removed for a faster speed. Extensive
+experiments on public benchmarks demonstrate the effectiveness and efficiency
+of PIMNet. Our code is available in the supplementary material.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212088808-8aaee96d-1505-4ed5-8326-314f36073488.png'
+ FPS:
+   DEVICE: 'NVIDIA M40'
+   ITEM: 35.2
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+     - Real
+   Test DataSets:
+     Avg.: 93.8
+     IIIT5K:
+       WAICS: 96.7
+     SVT:
+       WAICS: 94.7
+     IC13:
+       WAICS: 95.4
+     IC15:
+       WAICS: 85.9
+     SVTP:
+       WAICS: 88.2
+     CUTE:
+       WAICS: 92.7
+Bibtex: '@inproceedings{qiao2021pimnet,
+  title={PIMNet: a parallel, iterative and mimicking network for scene text recognition},
+  author={Qiao, Zhi and Zhou, Yu and Wei, Jin and Wang, Wei and Zhang, Yuan and Jiang, Ning and Wang, Hongbin and Wang, Weiping},
+  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
+  pages={2046--2055},
+  year={2021}
+}'
diff --git a/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml b/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml
new file mode 100644
index 000000000..6a2599136
--- /dev/null
+++ b/paper_zoo/textrecog/SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition.yaml	
@@ -0,0 +1,71 @@
+Title: 'SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition'
+Abbreviation: SEED
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2020
+Lab/Company:
+ - Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
+ - School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Qiao_SEED_Semantics_Enhanced_Encoder-Decoder_Framework_for_Scene_Text_Recognition_CVPR_2020_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2005.10977'
+Paper Reading URL: N/A
+Code: 'https://github.com/Pay20Y/SEED'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition is a hot research topic in computer vision.
+Recently, many recognition methods based on the encoder-decoder framework have
+been proposed, and they can handle scene texts of perspective distortion and
+curve shape. Nevertheless, they still face lots of challenges like image blur,
+uneven illumination, and incomplete characters. We argue that most encoder-decoder
+methods are based on local visual features without explicit global semantic
+information. In this work, we propose a semantics enhanced encoder-decoder
+framework to robustly recognize low-quality scene texts. The semantic
+information is used both in the encoder module for supervision and in the
+decoder module for initializing. In particular, the state-of-theart ASTER
+method is integrated into the proposed framework as an exemplar. Extensive
+experiments demonstrate that the proposed framework is more robust for
+low-quality text images, and achieves state-of-the-art results on several
+benchmark datasets. The source code will be available.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212231212-43636b78-1fa7-40bf-83f2-f0ea281ca55f.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 86.9
+     IIIT5K:
+       WAICS: 93.8
+     SVT:
+       WAICS: 89.6
+     IC13:
+       WAICS: 92.8
+     IC15:
+       WAICS: 80.0
+     SVTP:
+       WAICS: 81.4
+     CUTE:
+       WAICS: 83.6
+Bibtex: '@inproceedings{qiao2020seed,
+  title={Seed: Semantics enhanced encoder-decoder framework for scene text recognition},
+  author={Qiao, Zhi and Zhou, Yu and Yang, Dongbao and Zhou, Yucan and Wang, Weiping},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={13528--13537},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml
new file mode 100644
index 000000000..cc23a03ad
--- /dev/null
+++ b/paper_zoo/textrecog/SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition.yaml	
@@ -0,0 +1,76 @@
+Title: 'SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition'
+Abbreviation: SPIN
+Tasks:
+ - TextRecog
+Venue: AAAI
+Year: 2021
+Lab/Company:
+ - Shanghai Jiaotong University, China
+ - Hikvision Research Institute, China
+ - Zhejiang University, China
+URL:
+  Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/16442'
+  Arxiv: 'https://arxiv.org/abs/2005.13117'
+Paper Reading URL: N/A
+Code: 'https://github.com/hikopensource/DAVAR-Lab-OCR'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Arbitrary text appearance poses a great challenge in scene text
+recognition tasks. Existing works mostly handle with the problem in
+consideration of the shape distortion, including perspective distortions, line
+curvature or other style variations. Rectification (i.e., spatial transformers)
+as the preprocessing stage is one popular approach and extensively studied.
+However, chromatic difficulties in complex scenes have not been paid much
+attention on. In this work, we introduce a new learnable geometric-unrelated
+rectification, StructurePreserving Inner Offset Network (SPIN), which allows
+the color manipulation of source data within the network. This differentiable
+module can be inserted before any recognition architecture to ease the
+downstream tasks, giving neural networks the ability to actively transform
+input intensity rather than only the spatial rectification. It can also serve
+ as a complementary module to known spatial transformations and work in both
+ independent and collaborative ways with them. Extensive experiments show the
+ proposed transformation outperforms existing rectification networks and has
+ comparable performance among the state-of-the-arts.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211321597-997c4f09-fceb-4fe6-89d6-774971d942ed.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: 11e7
+ PARAMS: 2.31e6
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 88.7
+     IIIT5K:
+       WAICS: 95.2
+     SVT:
+       WAICS: 90.9
+     IC13:
+       WAICS: 94.8
+     IC15:
+       WAICS: 79.5
+     SVTP:
+       WAICS: 83.2
+     CUTE:
+       WAICS: 87.5
+Bibtex: '@inproceedings{zhang2021spin,
+  title={SPIN: Structure-preserving inner offset network for scene text recognition},
+  author={Zhang, Chengwei and Xu, Yunlu and Cheng, Zhanzhan and Pu, Shiliang and Niu, Yi and Wu, Fei and Zou, Futai},
+  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+  volume={35},
+  number={4},
+  pages={3305--3314},
+  year={2021}
+}'
diff --git a/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml b/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml
new file mode 100644
index 000000000..593e66037
--- /dev/null
+++ b/paper_zoo/textrecog/What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels.yaml	
@@ -0,0 +1,78 @@
+Title: 'What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels'
+Abbreviation: Baek at al.
+Tasks:
+ - TextRecog
+Venue: CVOR
+Year: 2021
+Lab/Company:
+ - The University of Tokyo
+URL:
+  Venue: 'http://openaccess.thecvf.com/content/CVPR2021/html/Baek_What_if_We_Only_Use_Real_Datasets_for_Scene_Text_CVPR_2021_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2103.04400'
+Paper Reading URL: N/A
+Code: 'https: //github.com/ku21fan/STR-Fewer-Labels'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+ - Dataset
+Abstract: 'Scene text recognition (STR) task has a common practice: All
+state-of-the-art STR models are trained on large synthetic data. In contrast
+to this practice, training STR models only on fewer real labels (STR with fewer
+labels) is important when we have to train STR models without synthetic data:
+for handwritten or artistic texts that are difficult to generate synthetically
+and for languages other than English for which we do not always have synthetic
+data. However, there has been implicit common knowledge that training STR
+models on real data is nearly impossible because real data is insufficient.
+We consider that this common knowledge has obstructed the study of STR with
+fewer labels. In this work, we would like to reactivate STR with fewer labels
+by disproving the common knowledge. We consolidate recently accumulated public
+real data and show that we can train STR models satisfactorily only with real
+labeled data. Subsequently, we find simple data augmentation to fully exploit
+real data. Furthermore, we improve the models by collecting unlabeled data and
+introducing semi- and self-supervised methods. As a result, we obtain a
+competitive model to state-of-the-art methods. To the best of our knowledge,
+this is the first study that 1) shows sufficient performance by only using
+real labels and 2) introduces semi- and self-supervised methods into STR with
+fewer labels. Our code and data are available:
+https: //github.com/ku21fan/STR-Fewer-Labels.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+  - Self-Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/212229475-09e37af2-b48d-4977-aafd-9efb63570dff.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - Real
+   Test DataSets:
+     Avg.: 89.3
+     IIIT5K:
+       WAICS: 94.8
+     SVT:
+       WAICS: 91.3
+     IC13:
+       WAICS: 94.0
+     IC15:
+       WAICS: 80.6
+     SVTP:
+       WAICS: 82.7
+     CUTE:
+       WAICS: 88.1
+Bibtex: '@inproceedings{baek2021if,
+  title={What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels},
+  author={Baek, Jeonghun and Matsui, Yusuke and Aizawa, Kiyoharu},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={3113--3122},
+  year={2021}
+}'