Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paper List-1] Add 10 textrecog papers #1644

Open
wants to merge 7 commits into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'Background-Insensitive Scene Text Recognition with Text Semantic Segmentation'
Abbreviation: BINet
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- University of South Carolina, Columbia, SC 29201, USA
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19806-9_10'
Arxiv: 'https://www.cse.sc.edu/~songwang/document/eccv22c.pdf'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene Text Recognition (STR) has many important applications in
computer vision. Complex backgrounds continue to be a big challenge for STR
because they interfere with text feature extraction. Many existing methods
use attentional regions, bounding boxes or polygons to reduce such
interference. However, the text regions located by these methods still contain
much undesirable background interference. In this paper, we propose a
Background-Insensitive approach BINet by explicitly leveraging the text
Semantic Segmentation (SSN) to extract texts more accurately. SSN is trained
on a set of existing segmentation data, whose volume is only 0.03% of STR
training data. This prevents the large-scale pixel-level annotations of the
STR training data. To effectively utilize the segmentation cues, we design new
segmentation refinement and embedding blocks for refining text-masks and
reinforcing visual features. Additionally, we propose an efficient pipeline
that utilizes Synthetic Initialization (SI) for STR models trained only on
real data (1.7% of STR training data), instead of on both synthetic and real
data from scratch. Experiments show that the proposed method can recognize
text from complex backgrounds more effectively, achieving state-of-the-art
performance on several public datasets.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/209490055-2aca52d4-9072-4ce4-a256-7fc8b953f59e.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
- Real
Test DataSets:
Avg.: 94.4
IIIT5K:
WAICS: 97.3
SVT:
WAICS: 96.4
IC13:
WAICS: 96.8
IC15:
WAICS: 89.2
SVTP:
WAICS: 89.9
CUTE:
WAICS: 95.8
Bibtex: '@inproceedings{zhao2022background,
title={Background-Insensitive Scene Text Recognition with Text Semantic Segmentation},
author={Zhao, Liang and Wu, Zhenyao and Wu, Xinyi and Wilsbacher, Greg and Wang, Song},
booktitle={European Conference on Computer Vision},
pages={163--182},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Title: 'Context-based Contrastive Learning for Scene Text Recognition'
Abbreviation: ConCLR
Tasks:
- TextRecog
Venue: AAAI
Year: 2022
Lab/Company:
- The Chinese University of Hong Kong
- SmartMore
URL:
Venue: 'https://www.aaai.org/AAAI22Papers/AAAI-10147.ZhangX.pdf'
Arxiv: 'http://www.cse.cuhk.edu.hk/~byu/papers/C139-AAAI2022-ConCLR.pdf'
Paper Reading URL: 'https://mp.weixin.qq.com/s/7ayYKALDc3-nsBgEJG-D2A'
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Pursuing accurate and robust recognizers has been a longlasting goal
for scene text recognition (STR) researchers. Recently, attention-based methods
have demonstrated their effectiveness and achieved impressive results on public
benchmarks. The attention mechanism enables models to recognize scene text with
severe visual distortions by leveraging contextual information. However, recent
studies revealed that the implicit over-reliance of context leads to catastrophic
out-ofvocabulary performance. On the contrary to the superior accuracy of the
seen text, models are prone to misrecognize unseen text even with good image
quality. We propose a novel framework, Context-based contrastive learning
(ConCLR), to alleviate this issue. Our proposed method first generates
characters with different contexts via simple image concatenation operations
and then optimizes contrastive loss on their embeddings. By pulling together
clusters of identical characters within various contexts and pushing apart
clusters of different characters in embedding space, ConCLR suppresses the
side-effect of overfitting to specific contexts and learns a more robust
representation. Experiments show that ConCLR significantly improves
out-of-vocabulary generalization and achieves state-of-the-art performance on
public benchmarks together with attention-based recognizers.'
MODELS:
Architecture:
- CTC
- Attention
- Transformer
Learning Method:
- Self-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/209343799-96428e0e-9a93-4763-be47-a23f575dc2f3.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 92.4
IIIT5K:
WAICS: 96.5
SVT:
WAICS: 94.3
IC13:
WAICS: 97.7
IC15:
WAICS: 85.4
SVTP:
WAICS: 89.3
CUTE:
WAICS: 91.3
Bibtex: '@inproceedings{zhang2022context,
title={Context-based Contrastive Learning for Scene Text Recognition},
author={Zhang, Xinyun and Zhu, Binwu and Yao, Xufeng and Sun, Qi and Li, Ruiyu and Yu, Bei},
year={2022},
organization={AAAI}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Title: 'MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining'
Abbreviation: MaskOCR
Tasks:
- TextRecog
Venue: arXiv
Year: 2022
Lab/Company:
- Department of Computer Vision Technology (VIS), Baidu Inc.
URL:
Venue: N/A
Arxiv: 'https://arxiv.org/abs/2206.00311'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
- Dataset
Abstract: 'In this paper, we present a model pretraining technique, named
MaskOCR, for text recognition. Our text recognition architecture is an
encoder-decoder transformer: the encoder extracts the patch-level
representations, and the decoder recognizes the text from the representations.
Our approach pretrains both the encoder and the decoder in a sequential manner.
(i) We pretrain the encoder in a self-supervised manner over a large set of
unlabeled real text images. We adopt the masked image modeling approach, which
shows the effectiveness for general images, expecting that the representations
take on semantics. (ii) We pretrain the decoder over a large set of synthesized
text images in a supervised manner and enhance the language modeling capability
of the decoder by randomly masking some text image patches occupied by
characters input to the encoder and accordingly the representations input to
the decoder. Experiments show that the proposed MaskOCR approach achieves
superior results on the benchmark datasets, including Chinese and English text
images.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Self-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/209494710-489fe94b-d550-4c5e-bdab-24590a3c3fe2.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: 315M
Experiment:
Training DataSets:
- ST
- MJ
- Real
Test DataSets:
Avg.: 93.8
IIIT5K:
WAICS: 96.5
SVT:
WAICS: 94.1
IC13:
WAICS: 97.8
IC15:
WAICS: 88.7
SVTP:
WAICS: 90.2
CUTE:
WAICS: 92.7
Bibtex: '@article{lyu2022maskocr,
title={MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining},
author={Lyu, Pengyuan and Zhang, Chengquan and Liu, Shanshan and Qiao, Meina and Xu, Yangliu and Wu, Liang and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},
journal={arXiv preprint arXiv:2206.00311},
year={2022}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
Title: 'Multimodal Semi-Supervised Learning for Text Recognition'
Abbreviation: SemiMTR
Tasks:
- TextRecog
Venue: arXiv
Year: 2022
Lab/Company:
- AWS AI Labs
URL:
Arxiv: 'https://arxiv.org/abs/2211.04785'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Dataset
Abstract: 'Until recently, the number of public real-world text images was
insufficient for training scene text recognizers. Therefore, most modern
training methods rely on synthetic data and operate in a fully supervised
manner. Nevertheless, the amount of public real-world text images has increased
significantly lately, including a great deal of unlabeled data. Leveraging
these resources requires semi-supervised approaches; however, the few existing
methods do not account for vision-language multimodality structure and
therefore suboptimal for state-of-the-art multimodal architectures. To bridge
this gap, we present semi-supervised learning for multimodal text recognizers
(SemiMTR) that leverages unlabeled data at each modality training phase.
Notably, our method refrains from extra training stages and maintains the
current three-stage multimodal training procedure. Our algorithm starts by
pretraining the vision model through a single-stage training that unifies
self-supervised learning with supervised training. More specifically, we extend
an existing visual representation learning algorithm and propose the first
contrastivebased method for scene text recognition. After pretraining the
language model on a text corpus, we fine-tune the entire network via a
sequential, character-level, consistency regularization between weakly and
strongly augmented views of text images. In a novel setup, consistency is
enforced on each modality separately. Extensive experiments validate that
our method outperforms the current training schemes and achieves
stateof-the-art results on multiple scene text recognition benchmarks.
Code will be published upon publication.'
MODELS:
Architecture:
- Attention
Learning Method:
- Self-Supervised
- Semi-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/209488117-5c6c6ee1-3419-4aec-97f5-1e1b28ae25ff.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
- Real
Test DataSets:
Avg.: 93.3
IIIT5K:
WAICS: 97.3
SVT:
WAICS: 96.6
IC13:
WAICS: 97.0
IC15:
WAICS: 84.7
SVTP:
WAICS: 93.0
CUTE:
WAICS: 93.8
Bibtex: '@article{aberdam2022multimodal,
title={Multimodal Semi-Supervised Learning for Text Recognition},
author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},
journal={arXiv preprint arXiv:2205.03873},
year={2022}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Title: 'Pure Transformer with Integrated Experts for Scene Text Recognition'
Abbreviation: PETR
Tasks:
- TextRecog
Venue: TIP
Year: 2022
Lab/Company:
- University of Science and Technology of China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_28'
Arxiv: 'https://ui.adsabs.harvard.edu/abs/2022arXiv221104963T/abstract'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'The exploration of linguistic information promotes the development of
scene text recognition task. Benefiting from the significance in parallel
reasoning and global relationship capture, transformer-based language model
(TLM) has achieved dominant performance recently. As a decoupled structure
from the recognition process, we argue that TLM’s capability is limited by the
input low-quality visual prediction. To be specific: 1) The visual prediction
with low character-wise accuracy increases the correction burden of TLM. 2)
The inconsistent word length between visual prediction and original image
provides a wrong language modeling guidance in TLM. In this paper, we propose
a Progressive scEne Text Recognizer (PETR) to improve the capability of
transformer-based language model by handling above two problems. Firstly, a
Destruction Learning Module (DLM) is proposed to consider the linguistic
information in the visual context. DLM introduces the recognition of destructed
images with disordered patches in the training stage. Through guiding the
vision model to restore patch orders and make word-level prediction on the
destructed images, visual prediction with high character-wise accuracy is
obtained by exploring inner relationship between the local visual patches.
Secondly, a new Language Rectification Module (LRM) is proposed to optimize
the word length for language guidance rectification. Through progressively
implementing LRM in different language modeling steps, a novel progressive
rectification network is constructed to handle some extremely challenging
cases (e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR
enhances the capability of transformer-based language model from a more
general aspect, that is, focusing on the reduction of correction burden and
rectification of language modeling guidance. Compared with parallel
transformer-based methods, PETR obtains 1.0% and 0.8% improvement on regular
and irregular datasets respectively while introducing only 1.7M additional
parameters. The extensive experiments on both English and Chinese benchmarks
demonstrate that PETR achieves the state-of-the-art results.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/209489701-073cdf37-5990-4bcf-8aa8-434255fd568e.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 90.8
IIIT5K:
WAICS: 95.8
SVT:
WAICS: 92.4
IC13:
WAICS: 97.0
IC15:
WAICS: 83.3
SVTP:
WAICS: 86.2
CUTE:
WAICS: 89.9
Bibtex: '@article{wang2022petr,
title={PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition},
author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Xing, Mengting and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
journal={IEEE Transactions on Image Processing},
volume={31},
pages={5585--5598},
year={2022},
publisher={IEEE}
}'
Loading