Persian-Text-Engineering-Hub

It is a convolutional sequence to sequence model created based on Tachibana et al with modifications. This repo consists of notebooks to do the training and inferencing and provides proper datasets to do so.

Persian_g2p: A seq-to-seq model for Persian G2P mapping

Persian Grapheme-to-Phoneme (G2P) converter

G2P

The G2P algorithm is used to generate the most probable pronunciation for a word not contained in the lexicon dictionary. It could be used as a preprocess of text-to-speech system to generate pronunciation for OOV words.

Tihu Persia Dictionary

Tihu-dict is a pronouncing dictionary of Persian

Word Analyzing

CPIA - Contemporary Persian Inflectional Analyzer

Informal and Formal Persian word analyzer (inflection with FST)

Persian Morphologically Segmented Lexicon 0.5

This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes.

Universal Derivations v1.1

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivation, in a cross-linguistically consistent annotation scheme for many languages including Persian (semi-automatically). Consists of 7k families, 43k lexemes and 35k relations. Article. Dataset files.

polyglot

A morpheme Extracter for 135 languages including Persian.

PARSEME Corpse Fa

PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.

Universal Segmentations

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages including Persian. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc. It also has a python library or creating such data from text. This dataset consists of 45k Persian words.

Perstem

Persian stemmer and morphological analyzer

Persian Stemming Dataset

Consists of two stemmeing sets. 1) 4k words from Bootstrapping the Development of an HPSG-based Treebank for Persian and 2) 27k words from A syntactic valency lexicon for Persian verbs : The first steps towards Persian dependency treebank.

Persian Stemmer Python

A stemmer for Persian based on A new hybrid stemming method for persian language

Sentiment Analysis

Persian Sentiment Resources

Awesome Persian Sentiment Analysis Resources - منابع مرتبط با تحلیل احساسات در زبان فارسی

Consists of following datasets:
- Deep Neural Networks in Persian Sentiment Analysis
- Sentiment Analysis Challenges
- Sentiment Lexicon
- Sentiment Tagged Corpus (dataset)
- HesNegar: Persian Sentiment WordNet

Persian Sentiment Analyzer

Consists of data (3K) and code (notebook) to create a LSTM model for Sentiment Analysis.

Sentiment Analysis

Sentiment analysis using ML and DL models on Persian texts

LexiPers

A Sentiment Analysis Lexicon for Persian. Consists of 4k words

Taaghche | طاقچه

Persian book comment ratings dataset. Consists of about 70k comment about 11k books.

Digikala (comments & products)

The Digikala (comments & products) dataset offers a comprehensive glimpse into the vast online marketplace of Digikala, comprising over 1.2 million products and more than 6 million comments.

Digikala Comments

3k comments with score and ratings.

MirasOpinion

93k digikala products comments with manual labeling.

Persian tweets emotional dataset

20k tweets with emotion identification labels.

Persian Emotion Detection (tweets)

A Dataset of 30,000 emotion labeled Persian Tweets.

Persian Text Emotion

Consists of 5.56K tweets with labels (sadness, anger, happiness, hatred, wonder and fear) describing their emotions.

ArmanEmo

Consists of 7k docs with 6 emotion label types (sadness, anger, happiness, hatred, wonder, fear).

Snappfood

Snappfood (an online food delivery company) user comments containing 70,000 comments with two labels (i.e. polarity classification): Happy, Sad.

NRC Persian Lexicon

It is the Persian translation of NRC Emotion Lexicon which is a list of English words with their associate basic emotions in eigth categories( anger, fear, anticipation, trust, surprise, sadness, joy, and disgust).

Pars ABSA

Consists of 10k samples which each record focuses on one aspect (e.g. camera, screen resolution, etc of a comment about a cell phone) of a comment. Each comment may appear on more than one sample based on the number of aspects that exist in that comment.

PerSent -- Persian Sentiment Analysis and Opinion Mining Lexicon

Consists of 1500 words with their degrees of polarity.

DeepSentiPers

Utilizes the SentiPers dataset, which consists of 7,400 sentences, and enhances it with various embeddings to develop both LSTM and CNN models. All the original and newly transformed data, along with the notebooks used to create the models, are available in this repository.

ParsBERT

Fine-tuned a BERT based transofrmer on various sentiment analysis datasets like Digikala, SnappFood, SentiPers and Taaghche.

ParsiNLU

Persian NLP team trained various mt5 models on their sentiment analysis dataset.

Informal Persian

Shekasteh

Shekasteh is an evaluation dataset for Persian colloquial text. It comes from different genres, including blog posts, movie subtitles, and forum chats.

CPIA

Informal and Formal Persian word analyzer (inflection with FST)

Persian Slang

Persian Slang Words (dataset)

Informal Persian Universal Dependency Treebank (iPerUDT)

Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 tokens, is an open source collection of colloquial informal texts from Persian blogs.

Numbers <> Words

NumToPersian

Converts numbers to words.

Convert numbers to Persian words

Read me this number python -- Convert number to Persian

PersianNumberToWord

Convert numbers to Persian words.

DPERN

Describe PERsian Numbers

ParsiNorm

A normalizer which do a lot about numbers, both ways.

Persian Tools

Handling various number types in Persian text (like National ID, Sheba, etc)

petit

Persian text -> integer, ineteger -> text converter

num2fawords

Takes a number and converts it to Persian word form

Embeddings

FastText

Pre-trained word vectors of 157 languages including Persian, trained on CommonCrawl and Wikipedia using CBOW.

Persian Word Embedding

A tutorial on how to use 3 word embeddings; a) Downloading and using fasttext Persian word embeddings. b) How to get word embeddings of ParsBERT base model itself. c) How to get word embeddings of ParsGPT model.

Persian Word2Vec

A Persian Word2Vec Model trained by Wikipedia articles

Sentence Transformers (ParsBERT)

Three similar models based on fine-tuning ParsBERT base model on 3 different entailment datasets. Each of these models can be used for Semantic Search, Clustering, Summerization, Information retrieval and Topic Modeling tasks.

Benchmark

ParsiNLU

A comprehensive suite of high-level NLP tasks for Persian language. The dataset consists of the following tasks: Text entailment, Query paraphrasing, Reading comprehension, Multiple-choice QA, Machine translation and Sentiment analysis. They've been also fine-tuned mt5 models on these datasets which result in various Persian models.

ParsBench - pb

ParsBench provides toolkits for benchmarking LLMs based on the Persian language tasks.

ParsiNLU all tasks
Persian NER
Persian Math
ConjNLI Entailment
Persian MMLU (khayyam Chanllenge)

Benchmarking ChatGPT for Persian

Benchmarking ChatGPT for Persian: A Preliminary Study

Elemntry school
Mathematical problems dataset

QA

PersianQA

Persian (Farsi) Question Answering Dataset. with models: bert-base-fa-qa with 162M parameters fine-tuned on this dataset and xlm-roberta-large-fa-qa with 558M parameters fine-tuned on this dataset and SQuAD2.0 (English) dataset.

MeDiaPQA: A Question-Answering Dataset on Persian Medical Dialogues

Medical Question Answering dataset consists of 15k dialogs in 70 specialities.

Persian-QA-Wikipedia

26k QA and related excerpt extracted from Persian wikipedia. Some of the questions can not be answered based on the given excerpt by design (like SQuAD2.0).

ParsSQuAD

Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0

Crossword Cheat

Consists of 30K questions and answers of various Persian crossword puzzles.

ParsiNLU

Persian NLP team trained various mt5 and BERT models on their multiple-choice QA dataset.

Persian Conversational Dataset (Legal)

It consists of 266k legal questions, answers and related tags.

Alpaca Persian

Persian translation of 35k records of Stanford Alpaca Instruction dataset (52K records). There is also a version with different formatting.

Dependency Parsing

The Persian Universal Dependency Treebank (Persian UD)

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

The Persian Universal Dependency Treebank (PerUDT) (v1.0)

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Consists of 29k sentences.

PARSEME Corpse Fa

PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.

UDPipe 2

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files.

Informal Persian Universal Dependency Treebank (iPerUDT)

Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 tokens, is an open source collection of colloquial informal texts from Persian blogs.

Entailment

FarsTail: a Persian natural language inference dataset

10k pairs with entailment label.

Sentence Transformers

Utilizes the FarsTail dataset for fine-tuning its ParsBERT model, while also incorporating two other entailment datasets: Wiki Triplet and Wiki D/Similar.

ParsiNLU

Persian NLP team trained various mt5 and BERT models on their entailment dataset.

Datasets (classification)

Virgool Dataset

This could be a nice tool for Persian writers or bloggers to automatically pick the suggested hashtag or even subject for their articles. We could even collect data from google trend for each hashtag or 'label' used in an article. Consists of 11k+ articles.

BBC Persian Archive

The file contains 3780 news articles published by BBC Persian. The articles mostly belong to the year 1399 and 1400, and are published before Aban 18th, 1400. Columns are: title, publish_name, link, related_topics, body, category.

TasnimNews Dataset (Farsi - Persian) | تسنیم

Consists of 63k News articles with following columns: category, title, abstract, body, time.

Farsnews-1398

Yearly collection of the Farsnews agency (1398). Contains 294k News article with following columns: title, abstract, paragraphs, cat, subcat, tags, link.

Digikala Magazine (DigiMag)

A total of 8,515 articles scraped from Digikala Online Magazine. This dataset includes seven different classes: Video Games, Shopping Guide, Health Beauty, Science Technology, General, Art Cinema and Books Literature.

Miras Irony

Contains about 3K tweets, with each one of them labeled as either ironic or not.

Persian Stance Detection

4K of records of stance detection in headlines and bodies of News articles.

A Stance datatset

Consists of 5.5K pairs of tweets which the stance of the reply tweets have been marked as against, support or neither to the main tweet.

A Claim datatset

Consists of 3.8K tweets, in which the type of each claim in each tweet have been identified. ~~But it does not show where is the claim located in the main tweet.~~

NER

Persian Twitter NER (ParsTwiner)

Name Entity Recognition (NER) on the Persian Twitter dataset. Consists of 6 entity types: event, location, natinality, organization and pog (political organizations and historical dynasties). ~~12k Named Entities in 232k tokens~~.

NSURL-2019 task 7: Named Entity Recognition (NER) in Farsi

Extends PEYMA corpus (300k tokens), with another 600k tokens. Consists of 16 entity types including: date, location, percent number, money, time, person and organization. ~~48k NEs in 884k tokens~~.

PersianNER (Arman)

The dataset includes 250,015 tokens and 7,682 Persian sentences in total. Consists of 6 NE types including: facility, organization, location, event, person and proper noun. ~~37K NEs in 749k tokens~~.

Persian-NER

Crowd-sourced NE dataset with 5 NE types. ~~2.2M NEs in 25M tokens.~~

ParsNER

These dataset is a mixed NER dataset collected from ARMAN, PEYMA, and WikiANN that covered ten types of entities including: Date, Event, Facility, Location, Money, Organization, Percent, Person, Product and Time. 140K NEs in 40k sentences.

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

It is a large Multilingual Dataset for Entity Linking containing data in 53 languages including Persian. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. Paper. For this project UDPipe has been used.

xtreme

XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 typologically diverse languages and includes nine tasks. But for Persian it only consists of:

Wikiann named entity recognition
Universal dependencies part-of-speech tagging (rasooli et al.)

Unlabled and Raw

Persian SMS Dataset

Persian real SMS Dataset

Tarjoman (Persian Text) | ترجمان

Crawled more than 3k+ articles from tarjoman website.

Large-Scale Colloquial Persian

27M tweets. Although these texts have been labeled or translated using various NLP toolkits, they have never been supervised.

VOA 2003 - 2008

Consists of 8M words with following columns: title, date, url and body.

Ensani-ir Abstrsacts

219K abstracts collected from Ensani.ir papers.

Toxic text

Persian Abusive Words

We created a dataset of 33338 Persian tweets, of which 10% contained Abusive words and 90% were non-Abusive.

Sansorchi

Remove Persian (Farsi) Swear Words

Persian Swear Words

Persian Swear Dataset - you can use in your production to filter unwanted content. دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها

Stop word list

Persian stopwords collection

A collection of Persian stopwords. Consists of:

All combined

Different sources

Persian StopWords

Consists of about 2k stop words.

Spell checking

Persian Spell Checker with Kenlm

A complete instruction for training a Persian spell checker and a language model based on SymSpell and KenLM, using Wikipedia dataset. Tokens that are not in the vocab and has a very low frequency considered to be miss-spelled words and replaced with their equivalent from vocabs which maximizes the probabilty of the sentence.

FAspell

FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English. The dataset consists of two parts: a) faspell_main: list of 5050 pairs collected from errors made by elementary school pupils and professional typists. b) faspell_ocr: list of 800 pairs collected from the output of a Farsi OCR system.

Lilak, Persian Spell Checking Dictionary

Created data for hunspell library for spell checking and morphology analyzing.

Persian Spell Checker

Consists of some lists of miss-spelled words and some dictionaries of Persian word entries.

PerSpellData

A comprehensive parallel dataset designed for the task of spell checking in Persian. Misspelled sentences together with the correct form are produced using a massive confusion matrix, which is gathered from many sources. This dataset contains informal sentences in addition to the formal sentences, and contains texts from diverse topics. Both non-word and real-word errors are collected in the dataset

HeKasre

Code and data for detecting and correcting just a special kind of cognitive miss-spelling error in informal Persian.

Normalization

PersianUtils

Standardize your Persian text: Preprocessing, Embedding, and more!

Farsi-Normalizer

Simple Farsi normalizer

virastar

Cleanning up Persian text! (Ruby)

Python version

Virastar (ویراستار)

Virastar is a Persian text cleaner (JS).

Farsi Analyzer

A Persian normalization and tokenization tool, constructed as a plugin for Elasticsearch.

ParsiNorm

A normalizer which do a lot about numbers, both ways.

Transliteration

Tajik-to-Persian transliteration

Tajik-to-Persian transliteration model

F2F

Farsi to Finglish, a Persian transliterator

Behnevis

24k ASCII transliterated Persian words

Farsi to Tajiki

An attempt to make a transliterator of Farsi (Persian) web page to Tajiki (Cyrillic) with a bookmarklet.

Encyclopedia and Word Set

Vajehdan

Consists of following sets:

Words of Sareh Dictionary (Purified Persian Words)
Farhangestan chosen words for non-Persian equivalents.
Farhange Emlaee (A dictionary of Persian orthography and spelling)
A part of Ganjoor's website poetry repos.
Farhange Motaradef va Motazad (A dictionary of Persian synonyms and antonyms)
Farhange Teyfi (Persian Thesaurus)

persian-names

Persian names dataset

persian-names

A Python package for generating random Persian (Farsi) names.

persian-wordlist

A SQL database that includes a dictionary of 494,286 Persian words.

persianwordjson

This repository is a Persian meaningful database with json

persian-words-category

850k categorized Persian words.

similar-persian-words

pre-calculated list of similar Persian words ordered by rating and best match

an-array-of-persian-words

List of ~240,000 Persian words

persian-databases

Useful Persian dictionary and more. Consists of:

Dehkhoda dictionary (36k)
Synonyms (20k)
Arabic to Persian dictionary (113k)
Persian to Arabic dictionary(32k)
Abjad Persian to Arabic dictionary (42k)
Arabic to Persian dictionary (8k)
Quran Mofradat (1.6k)
Arabic monolingual dictionary (4.6k)
Intermediate Arabic dictionary (41k)
Alamsal - Arabic proverbs dictionary (4.5k)

Iranian job title

The "Iranian Job Title" dataset offers a comprehensive compilation of various job titles prevalent in Iran across diverse industries and sectors.

Moeen_thesaurus

Moeen dictionary based Thesaurus for Persian.

Enahnced Flexicon

It's an enhanced version of Flexicon word list with syllable, IPA procunciation and some refinements in word list itself.

Poetry and Literature

Hafez Poems

A simple Telegram bot implemented in Python.

Persian Databases

Useful Persian dictionary and more. Consists of:

Persian poetry of Iranian poets:
- Ahmad Shamlou
- Baba-Taher
- Parvin E'tesami
- Hafez
- Khayyam
- Rahi-Moayeri
- Roodaki
- Sa'di
- Sohrab Sepehri
- Shahriar
- Saeb Tabrizi
- Onsori
- Ferdowsi
- Forugh Farrokhzad
- Mehdi Akhavan Sales
- Mowlavi
- Nezami
- Nima Yushij
Quran Database
- Quran Surahs (114)
- Quran Versus (6236)
- Quran Versus Translation by Gomshe'i (6326)
- Quran Translation Word by word (83668)
- Reading voice of Famous Readers (48)

Shereno: A Dataset of Persian Modernist Poetry

Collection of Persian Modernist Poetry from Iranian contemporary poets

Persian Poems Corpus

Crawled Ganjoor for poems of 48 poets.

Persian Poet GPT2

This model fine-tuned on ParsGPT2 with Chronological Persian poetry dataset and can generate poems by providing the name of the poet.

Chronological Persian Poetry Dataset

Dataset of poetry of 67 Persian poets of different times.

Audio

PSDR

Persian spoken digit recognition

Persian Questions

Simple Persian Questions aimed to use in a voice assistant in 4 Categories. Labeled NEs in command utterances (in text).

Common Voice

About 60 hours audio produced by various users reading sentences. All sentences with duplicates are 500h+.

Persian Speech Corpus

This ~2.5-hour Single-Speaker Speech corpus.

ShEMO: Persian Speech Emotion Detection Database

A semi-natural db which contains emotional speech samples of Persian speakers. The database includes 3000 semi-natural utterances, equivalent to 3 h and 25 min of speech data extracted from online radio plays.

Speech to Text

A Deep-Learning-Based Persian Speech Recognition System. Takes advantage of various ASR platforms to create models for ASR. Also it uses various datasets including Mozzila CommonVoice and their own dataset which consists of 300h+ audio and transcription.

PCVC Speech Dataset

Phoneme based speech dataset.

Vosk

Open-source tool for speech recognition for various platforms and OSes, supprting 20 languages including Persian.

Wav2Vec2-Large-XLSR-53-Persian V3

It is a wav2vec model fine-tuned on Mozzila CommonVoice Persian dataset. The model and the notebook to recreate the model with extra data are avaialble.

Crawl Suite

Persian News Search Engine

A search engine for crawling news from the web, storing in a structured way, and querying through the stored documents for finding the most relevant results using Machine Learning and Information Retrieval techniques.

iranian-news-agencies-crawler

a crawler to fetch last news from Iranian(Persian) news agencies.

PersianCrawler

Open source crawler for Persian websites including Asriran, fa-Wikipedia, Tasnim, Isna.

POS Tagging

Persian_POS_Tagger

A Persian POS Tagger trained by The Persian Universal Dependency Treebank (Persian UD) with Tensorflow

PARSEME Corpse Fa

PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.

Multi-purpose tools with POS Tagging capability

Farsi NLP Tools

Scripts and models developed for POS Tagging and Dependency Parsing Persian based on TurboParser.

RDR POS Tagger

RDRPOSTagger is supports pre-trained UPOS, XPOS and morphological tagging models for about 80 languages including Persian. Java version.

Cross-platform Persian Parts-of-Speech tagger

This is another persian POS tagger

Various

Perke

A keyphrase extractor for Persian

PREDICT-Persian-Reverse-Dictionary

The first intelligent Persian reverse dictionary. Consists of various models for this task and datasets of Amid, Moeen, Dehkhoda, Persian Wikipedia and Persian Wordnet (Farsnet).

Persian-ATIS (Airline Travel Information System) Dataset

A Persian dataset for Joint Intent Detection and Slot Filling.

ParsiNLU Reading Comprehension

Persian NLP team trained various mt5 models on their reading comprehension dataset.

Base Models

ParsBERT

Family of ParsBERT models including BERT, DistilBERT, ALBERT and ROBERTA. All of which are transformer based models with encoder-decoder design.

mBERT

Multilingual BERT model consists of 104 languages including Persian.

Shiraz

Is a BERT based model trained on Divan dataset (proprietary). This model has 46.6M parameters. Its evaluation on NER and Sentiment Analysis is repoted.

Tehran

Is a BERT based model trained on Divan dataset (proprietary). This model has 124M parameters. Its evaluation on NER and Sentiment Analysis is repoted.

FaBERT

Is a Persian BERT model trained on various Persian texts.

AriaBERT

Is a Persian BERT model trained on various Persian texts.

TookaBERT

Is a Persian BERT model trained on various Persian texts with 123M parameters. There is also a large version of this model with 353M parameters.

Mocking

PersianFaker

Do you need some fake data?

UI/UX

Persian-Badge

Persian-Badge is a website for having metadata badges in the Persian language

OCR

Handwritten city names in Arabic Persian

This is a dataset of handwritten cities in Iran in Arabic/Persian that has been used in my Master project. This dataset is collected for sorting postal packages.

IranShahr

Hand-written / typed names of different cities of Iran in image format.

PLF Image Dataset

50*50 Images of Persian letters (without dots) with 32 Different Fonts.

Persian Subwords

Consists of about 20k images of Persian subwords in different fonts and sizes to be used in ocr models.

Spam

Persian SMS Spam Word

persian sms spam word

Image Captioning

Coco 2017 Farsi

Coco 2017 translated to Persian language. 91k images with caption in Persian.

Iranis dataset

Dataset of Farsi License Plate Characters (83k).

ParsVQA-Caps

The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.

CLIPfa

A dataset consists of 16M records of images and their corresponding texts. It also consists of a model traind on 400k of this dataset for searching images based on text and image.

Persian Image Captioning

Consists of about 26K records of images with th describing captions in Persian.

Translation

Persian movie dataset (English, Persian)

Persian language movies dataset from imvbox.com. 14k movies with storyline translated from Persian to English.

The Holy Quran

Quran ayat with translation in 21 languages.

The Bible

A multilingual parallel corpus created from translations of the Bible. In 100 languages including Persian.

W2C – Web to Corpus

A set of corpora for 120 languages including Persian automatically collected from wikipedia and the web.

ParsiNLU

Persian NLP team trained various mt5 models on their translation dataset.

Knowledge Graph

PERLEX

2.7k Relation of entities with translation and relation type.

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

It is a large Multilingual Dataset for Entity Linking containing data in 53 languages including Persian. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. Paper. For this project UDPipe has been used.

FarsBase

It is a knowledge graph platform designed for extracting information from Wikipedia, tables, and unstructured texts. A portion of its data is also available for download.

Baaz

Open information extraction from Persian web.

ParsSimpleQA

The Persian Simple Question Answering Dataset and System over Knowledge Graph. It consists of 36k records.

ParsFEVER

It is a dataset for Persian fact extraction and verification, developed in accordance with FEVER guidelines.

Summary

TasnimNews Dataset (Farsi - Persian) | تسنیم

Consists of 63k News articles with following columns: category, title, abstract, body, time.

Farsnews-1398

Yearly collection of the Farsnews agency (1398). Contains 294k News article with following columns: title, abstract, paragraphs, cat, subcat, tags, link.

Wiki Summary

95k documents with body and summery extracted from wikipedia Persian articles. There is also notebook to create and test models for summerization.

Persian Summarization

Statistical and Semantical Text Summarizer in Persian Language

Persian News Summary

A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification.

Sentence Transformers (ParsBERT)

Consists of similar models fine-tuned on ParsBERT using three different datasets, these models can be utilized for various applications, including Text summarization.

Miras Text

MirasText has more than 2.8 million articles and over 1.4 billion content words. Consists of following columns: content, summary, keywords, title, url.

Paraphrase

ExaPPC

Paraphrase data for Persian. It consists of 2.3M sentence pairs of which 1M of them are paraphrase and 1.3M are not parapharse of each other.

ParsiNLU

Persian NLP team trained various mt5 models on their query paraphrase dataset.

Persian Text Paraphrase

Consists of 800 pairs of Persian sentences wich are paraphrases of each other.

WSD

SBU WSD Corpus

SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation.

Generation

Dorna Llama3 8B Instruct

The Dorna models are a family of decoder-only models, specifically trained/fine-tuned on Persian data. This model is built using the Meta Llama 3 Instruct model. There are also quantized versions of this model.

PersianLLaMA 13B Instruct

With 13 billion parameters, this model has been fine-tuned using the Persian Alpaca dataset on Lllama 2 to excel at executing detailed instructions and delivering tailored outputs. There is also PersianLLaMA 13B which is fine-tuned on Persian wikipedia.

Thanks

Thanks to Awesome Persian NLP and Awesome Iranian Datasets for providing some elements of this long list.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github/workflows		.github/workflows
README.md		README.md

dhpour/Persian-Text-Engineering-Hub

Folders and files

Latest commit

History

Repository files navigation

Persian-Text-Engineering-Hub

Topics

Multi-purpose libs

Graheme to phoneme

Word Analyzing

Sentiment Analysis

Informal Persian

Numbers <> Words

Embeddings

Benchmark

QA

Dependency Parsing

Entailment

Datasets (classification)

NER

Unlabled and Raw

Toxic text

Stop word list

Different sources

Spell checking

Normalization

Transliteration

Encyclopedia and Word Set