open-mmlab · Mountchicken · Dec 14, 2022 · Dec 26, 2022 · Dec 26, 2022 · Dec 26, 2022
diff --git a/...trecog/Background-Insensitive Scene Text Recognition with Text Semantic Segmentation.yaml b/...trecog/Background-Insensitive Scene Text Recognition with Text Semantic Segmentation.yaml
@@ -0,0 +1,76 @@
+Title: 'Background-Insensitive Scene Text Recognition with Text Semantic Segmentation'
+Abbreviation: BINet
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - University of South Carolina, Columbia, SC 29201, USA
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19806-9_10'
+  Arxiv: 'https://www.cse.sc.edu/~songwang/document/eccv22c.pdf'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene Text Recognition (STR) has many important applications in
+computer vision. Complex backgrounds continue to be a big challenge for STR
+because they interfere with text feature extraction. Many existing methods
+use attentional regions, bounding boxes or polygons to reduce such
+interference. However, the text regions located by these methods still contain
+much undesirable background interference. In this paper, we propose a
+Background-Insensitive approach BINet by explicitly leveraging the text
+Semantic Segmentation (SSN) to extract texts more accurately. SSN is trained
+on a set of existing segmentation data, whose volume is only 0.03% of STR
+training data. This prevents the large-scale pixel-level annotations of the
+ STR training data. To effectively utilize the segmentation cues, we design new
+ segmentation refinement and embedding blocks for refining text-masks and
+ reinforcing visual features. Additionally, we propose an efficient pipeline
+ that utilizes Synthetic Initialization (SI) for STR models trained only on
+ real data (1.7% of STR training data), instead of on both synthetic and real
+ data from scratch. Experiments show that the proposed method can recognize
+ text from complex backgrounds more effectively, achieving state-of-the-art
+ performance on several public datasets.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/209490055-2aca52d4-9072-4ce4-a256-7fc8b953f59e.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+     - Real
+   Test DataSets:
+     Avg.: 94.4
+     IIIT5K:
+       WAICS: 97.3
+     SVT:
+       WAICS: 96.4
+     IC13:
+       WAICS: 96.8
+     IC15:
+       WAICS: 89.2
+     SVTP:
+       WAICS: 89.9
+     CUTE:
+       WAICS: 95.8
+Bibtex: '@inproceedings{zhao2022background,
+  title={Background-Insensitive Scene Text Recognition with Text Semantic Segmentation},
+  author={Zhao, Liang and Wu, Zhenyao and Wu, Xinyi and Wilsbacher, Greg and Wang, Song},
+  booktitle={European Conference on Computer Vision},
+  pages={163--182},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Context-based Contrastive Learning for Scene Text Recognition.yaml b/paper_zoo/textrecog/Context-based Contrastive Learning for Scene Text Recognition.yaml
@@ -0,0 +1,77 @@
+Title: 'Context-based Contrastive Learning for Scene Text Recognition'
+Abbreviation: ConCLR
+Tasks:
+ - TextRecog
+Venue: AAAI
+Year: 2022
+Lab/Company:
+ - The Chinese University of Hong Kong
+ - SmartMore
+URL:
+  Venue: 'https://www.aaai.org/AAAI22Papers/AAAI-10147.ZhangX.pdf'
+  Arxiv: 'http://www.cse.cuhk.edu.hk/~byu/papers/C139-AAAI2022-ConCLR.pdf'
+Paper Reading URL: 'https://mp.weixin.qq.com/s/7ayYKALDc3-nsBgEJG-D2A'
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Pursuing accurate and robust recognizers has been a longlasting goal
+for scene text recognition (STR) researchers. Recently, attention-based methods
+have demonstrated their effectiveness and achieved impressive results on public
+benchmarks. The attention mechanism enables models to recognize scene text with
+severe visual distortions by leveraging contextual information. However, recent
+studies revealed that the implicit over-reliance of context leads to catastrophic
+out-ofvocabulary performance. On the contrary to the superior accuracy of the
+seen text, models are prone to misrecognize unseen text even with good image
+quality. We propose a novel framework, Context-based contrastive learning
+(ConCLR), to alleviate this issue. Our proposed method first generates
+characters with different contexts via simple image concatenation operations
+and then optimizes contrastive loss on their embeddings. By pulling together
+clusters of identical characters within various contexts and pushing apart
+clusters of different characters in embedding space, ConCLR suppresses the
+side-effect of overfitting to specific contexts and learns a more robust
+representation. Experiments show that ConCLR significantly improves
+out-of-vocabulary generalization and achieves state-of-the-art performance on
+public benchmarks together with attention-based recognizers.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+  - Transformer
+ Learning Method:
+  - Self-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/209343799-96428e0e-9a93-4763-be47-a23f575dc2f3.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 92.4
+     IIIT5K:
+       WAICS: 96.5
+     SVT:
+       WAICS: 94.3
+     IC13:
+       WAICS: 97.7
+     IC15:
+       WAICS: 85.4
+     SVTP:
+       WAICS: 89.3
+     CUTE:
+       WAICS: 91.3
+Bibtex: '@inproceedings{zhang2022context,
+  title={Context-based Contrastive Learning for Scene Text Recognition},
+  author={Zhang, Xinyun and Zhu, Binwu and Yao, Xufeng and Sun, Qi and Li, Ruiyu and Yu, Bei},
+  year={2022},
+  organization={AAAI}
+}'
diff --git a/paper_zoo/textrecog/MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining.yaml b/paper_zoo/textrecog/MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining.yaml
@@ -0,0 +1,73 @@
+Title: 'MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining'
+Abbreviation: MaskOCR
+Tasks:
+ - TextRecog
+Venue: arXiv
+Year: 2022
+Lab/Company:
+ - Department of Computer Vision Technology (VIS), Baidu Inc.
+URL:
+  Venue: N/A
+  Arxiv: 'https://arxiv.org/abs/2206.00311'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+ - Dataset
+Abstract: 'In this paper, we present a model pretraining technique, named
+MaskOCR, for text recognition. Our text recognition architecture is an
+encoder-decoder transformer: the encoder extracts the patch-level
+representations, and the decoder recognizes the text from the representations.
+Our approach pretrains both the encoder and the decoder in a sequential manner.
+(i) We pretrain the encoder in a self-supervised manner over a large set of
+unlabeled real text images. We adopt the masked image modeling approach, which
+shows the effectiveness for general images, expecting that the representations
+take on semantics. (ii) We pretrain the decoder over a large set of synthesized
+text images in a supervised manner and enhance the language modeling capability
+of the decoder by randomly masking some text image patches occupied by
+characters input to the encoder and accordingly the representations input to
+the decoder. Experiments show that the proposed MaskOCR approach achieves
+superior results on the benchmark datasets, including Chinese and English text
+images.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Self-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/209494710-489fe94b-d550-4c5e-bdab-24590a3c3fe2.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: 315M
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+     - Real
+   Test DataSets:
+     Avg.: 93.8
+     IIIT5K:
+       WAICS: 96.5
+     SVT:
+       WAICS: 94.1
+     IC13:
+       WAICS: 97.8
+     IC15:
+       WAICS: 88.7
+     SVTP:
+       WAICS: 90.2
+     CUTE:
+       WAICS: 92.7
+Bibtex: '@article{lyu2022maskocr,
+  title={MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining},
+  author={Lyu, Pengyuan and Zhang, Chengquan and Liu, Shanshan and Qiao, Meina and Xu, Yangliu and Wu, Liang and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},
+  journal={arXiv preprint arXiv:2206.00311},
+  year={2022}
+}'
diff --git a/paper_zoo/textrecog/Multimodal Semi-Supervised Learning for Text Recognition.yaml b/paper_zoo/textrecog/Multimodal Semi-Supervised Learning for Text Recognition.yaml
@@ -0,0 +1,80 @@
+Title: 'Multimodal Semi-Supervised Learning for Text Recognition'
+Abbreviation: SemiMTR
+Tasks:
+ - TextRecog
+Venue: arXiv
+Year: 2022
+Lab/Company:
+ - AWS AI Labs
+URL:
+  Arxiv: 'https://arxiv.org/abs/2211.04785'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Dataset
+Abstract: 'Until recently, the number of public real-world text images was
+insufficient for training scene text recognizers. Therefore, most modern
+training methods rely on synthetic data and operate in a fully supervised
+manner. Nevertheless, the amount of public real-world text images has increased
+significantly lately, including a great deal of unlabeled data. Leveraging
+these resources requires semi-supervised approaches; however, the few existing
+methods do not account for vision-language multimodality structure and
+therefore suboptimal for state-of-the-art multimodal architectures. To bridge
+this gap, we present semi-supervised learning for multimodal text recognizers
+(SemiMTR) that leverages unlabeled data at each modality training phase.
+Notably, our method refrains from extra training stages and maintains the
+current three-stage multimodal training procedure. Our algorithm starts by
+pretraining the vision model through a single-stage training that unifies
+self-supervised learning with supervised training. More specifically, we extend
+an existing visual representation learning algorithm and propose the first
+contrastivebased method for scene text recognition. After pretraining the
+language model on a text corpus, we fine-tune the entire network via a
+sequential, character-level, consistency regularization between weakly and
+strongly augmented views of text images. In a novel setup, consistency is
+enforced on each modality separately. Extensive experiments validate that
+our method outperforms the current training schemes and achieves
+stateof-the-art results on multiple scene text recognition benchmarks.
+Code will be published upon publication.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Self-Supervised
+  - Semi-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/209488117-5c6c6ee1-3419-4aec-97f5-1e1b28ae25ff.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+     - Real
+   Test DataSets:
+     Avg.: 93.3
+     IIIT5K:
+       WAICS: 97.3
+     SVT:
+       WAICS: 96.6
+     IC13:
+       WAICS: 97.0
+     IC15:
+       WAICS: 84.7
+     SVTP:
+       WAICS: 93.0
+     CUTE:
+       WAICS: 93.8
+Bibtex: '@article{aberdam2022multimodal,
+  title={Multimodal Semi-Supervised Learning for Text Recognition},
+  author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},
+  journal={arXiv preprint arXiv:2205.03873},
+  year={2022}
+}'
diff --git a/...hinking the Capability of Transformer-Based Language Model in Scene Text Recognition.yaml b/...hinking the Capability of Transformer-Based Language Model in Scene Text Recognition.yaml
@@ -0,0 +1,87 @@
+Title: 'Pure Transformer with Integrated Experts for Scene Text Recognition'
+Abbreviation: PETR
+Tasks:
+ - TextRecog
+Venue: TIP
+Year: 2022
+Lab/Company:
+ - University of Science and Technology of China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_28'
+  Arxiv: 'https://ui.adsabs.harvard.edu/abs/2022arXiv221104963T/abstract'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'The exploration of linguistic information promotes the development of
+scene text recognition task. Benefiting from the significance in parallel
+reasoning and global relationship capture, transformer-based language model
+(TLM) has achieved dominant performance recently. As a decoupled structure
+from the recognition process, we argue that TLM’s capability is limited by the
+input low-quality visual prediction. To be specific: 1) The visual prediction
+with low character-wise accuracy increases the correction burden of TLM. 2)
+The inconsistent word length between visual prediction and original image
+provides a wrong language modeling guidance in TLM. In this paper, we propose
+a Progressive scEne Text Recognizer (PETR) to improve the capability of
+transformer-based language model by handling above two problems. Firstly, a
+Destruction Learning Module (DLM) is proposed to consider the linguistic
+information in the visual context. DLM introduces the recognition of destructed
+images with disordered patches in the training stage. Through guiding the
+vision model to restore patch orders and make word-level prediction on the
+destructed images, visual prediction with high character-wise accuracy is
+obtained by exploring inner relationship between the local visual patches.
+Secondly, a new Language Rectification Module (LRM) is proposed to optimize
+the word length for language guidance rectification. Through progressively
+implementing LRM in different language modeling steps, a novel progressive
+rectification network is constructed to handle some extremely challenging
+cases (e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR
+enhances the capability of transformer-based language model from a more
+general aspect, that is, focusing on the reduction of correction burden and
+rectification of language modeling guidance. Compared with parallel
+transformer-based methods, PETR obtains 1.0% and 0.8% improvement on regular
+and irregular datasets respectively while introducing only 1.7M additional
+parameters. The extensive experiments on both English and Chinese benchmarks
+demonstrate that PETR achieves the state-of-the-art results.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/209489701-073cdf37-5990-4bcf-8aa8-434255fd568e.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 90.8
+     IIIT5K:
+       WAICS: 95.8
+     SVT:
+       WAICS: 92.4
+     IC13:
+       WAICS: 97.0
+     IC15:
+       WAICS: 83.3
+     SVTP:
+       WAICS: 86.2
+     CUTE:
+       WAICS: 89.9
+Bibtex: '@article{wang2022petr,
+  title={PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition},
+  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Xing, Mengting and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
+  journal={IEEE Transactions on Image Processing},
+  volume={31},
+  pages={5585--5598},
+  year={2022},
+  publisher={IEEE}
+}'