Project RBZ

This is the Github page for my PhD research "Sentiment Analysis for Arabizi on Social Media" where we publicise the outcome resources of the project. Visit the webpage for more details and content about the project.

Introduction

Arabizi is the name given to a new social transcription of the spoken Arabic in Latin script. The term comes from the portmanteau of Araby (Arabic) and Englizi (English). It is an informal written language where Arabs transcribe their dialectal mother tongue in text using Latin alphanumeral instead of Arabic script. For example حبيبي Ḥabībī my-love could be transcribed as 7abibi in Arabizi.

Arabizi is extremely low resourced for Natural Language Processing (NLP), in this research we focus on resourcing Lebanese dialect Arabizi for sentiment analysis, an NLP classification task of text into classes of positive, negative, or neutral automatically. However the nature of the Arabizi scripture poses many challenges for sentiment classification, such that it is highly sparse and codeswitched, meaning words could have a large number of forms, whether orthographic or morphologic, or mixed with words of other languages such as French or English.

Arabic is also a phonetically-rich language containing short and long vowels, soft and emphasised consonants, and guttural phonemes such that transcribing it in Latin script, a relatively phonetically-poor language, generates severe word ambiguities that it becomes difficult to transliterate to Arabic. For that, one Arabizi word could easily map to several Arabic words of different meanings. We appreciated the fact that Arabizi is a social language and resourced it independently without attempting to transliterate it to Arabic.

Read more in-depth about the challenges of Arabizi for sentiment classification and transliteration here.

Resources

Find the following files in the resources directory.

Arabizi Identification in Twitter Data (2016)

In this paper we present a pilot study about the percentage of Arabizi usage on Twitter across Lebanon and Egypt. We also describe our approach of training a classifier that identifies Arabizi from other Latin script languages.

This file contains two 5k tweets annotated datasets (Arabizi/Not Arabizi) from Lebanon and Egypt.

SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (2019)

In this paper we present the outcomes of the work: SenZi, the new Lebanese dialect Arabizi sentiment lexicon, sentiment annotated datasets, and a Facebook corpus. We then detail our approach in expanding every sentiment word in SenZi to match with its inflectional and orthgraphic variants automatically using word embeddings jointly with a simple rule-based technique.

This file contains:

SenZi: The original sentiment lexicon consisting of 2K sentiment words.
Senzi Expanded: Orthographically and morphologically rich sentiment lexicon cosisting of 25K sentiment words.
Datasets: Arabizi/Not-Arabizi 4.4K tweets, and sentiment (positive/negative) 1.6K tweets annotated datasets.
Corpus: 1M Arabizi comments extracted from 47 public facebook pages.
Embeddings: Word2vec and Fasttext word embeddings spaces trained on a filtered corpus.

Contact

Have any questions? Get in touch.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.jekyll-cache/Jekyll/Cache		.jekyll-cache/Jekyll/Cache
_includes		_includes
_layouts		_layouts
_sass		_sass
_site		_site
assets		assets
docs		docs
resources		resources
script		script
.DS_Store		.DS_Store
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
another-page.md		another-page.md
index.md		index.md
jekyll-theme-architect.gemspec		jekyll-theme-architect.gemspec
thumbnail.png		thumbnail.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project RBZ

Introduction

Resources

Arabizi Identification in Twitter Data (2016)

SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (2019)

Contact

About

Releases

Packages

Languages

License

TahaTobaili/project-rbz

Folders and files

Latest commit

History

Repository files navigation

Project RBZ

Introduction

Resources

Arabizi Identification in Twitter Data (2016)

SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (2019)

Contact

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages