Skip to content

Commit

Permalink
add individual jsons instead of full data
Browse files Browse the repository at this point in the history
  • Loading branch information
zaidalyafeai committed Dec 20, 2024
1 parent c3f76fe commit dd11107
Show file tree
Hide file tree
Showing 706 changed files with 26,984 additions and 0 deletions.
36 changes: 36 additions & 0 deletions datasets/101_billion_arabic_words_dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "101 Billion Arabic Words Dataset",
"Subsets": [],
"HF Link": "https://hf.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset",
"Link": "https://hf.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset",
"License": "Apache-2.0",
"Year": 2024,
"Language": "ar",
"Dialect": "mixed",
"Domain": "web pages",
"Form": "text",
"Collection Style": "crawling",
"Description": "The 101 Billion Arabic Words Dataset is curated by the Clusterlab team and consists of 101 billion words extracted and cleaned from web content, specifically targeting Arabic text. This dataset is intended for use in natural language processing applications, particularly in training and fine-tuning Large Language Models (LLMs) capable of understanding and generating Arabic text.",
"Volume": "101,000,000,000",
"Unit": "tokens",
"Ethical Risks": "High",
"Provider": "Clusterlab",
"Derived From": "Common Crawl",
"Paper Title": "101 Billion Arabic Words Dataset",
"Paper Link": "https://arxiv.org/pdf/2405.01590v1",
"Script": "Arab",
"Tokenized": "No",
"Host": "HuggingFace",
"Access": "Free",
"Cost": "nan",
"Test Split": "No",
"Tasks": "text generation, language modeling",
"Venue Title": "arXiv",
"Citations": "nan",
"Venue Type": "preprint",
"Venue Name": "nan",
"Authors": "Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, and Chehir Dhaouadi",
"Affiliations": "Clusterlab",
"Abstract": "In recent years, Large Language Models (LLMs) have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue\u2014the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic language models.",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/1993-2007_united_nations_parallel_text.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "1993-2007 United Nations Parallel Text",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2013T06",
"License": "LDC User Agreement for Non-Members",
"Year": 2013,
"Language": "multilingual",
"Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
"Domain": "other",
"Form": "text",
"Collection Style": "other",
"Description": "The data is presented as raw text and word-aligned text. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding.",
"Volume": "520,283",
"Unit": "documents",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "175.00 $",
"Test Split": "No",
"Tasks": "machine translation",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/1997_hub5_arabic_evaluation.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "1997 HUB5 Arabic Evaluation",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2002S22",
"License": "LDC User Agreement for Non-Members",
"Year": 2002,
"Language": "ar",
"Dialect": "ar-EG: (Arabic (Egypt))",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "This publication contains 20 sphere files encoded in two channel interleaved mulaw with a sampling rate of 8 KHz, for a total of 424,160,000 bytes (405 Mbytes) of sphere data. The sphere headers have been modified from the original Evaluation data by the addition of sample checksums to the CALLHOME data files.",
"Volume": "20",
"Unit": "documents",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "nan",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "1,500.00 $",
"Test Split": "No",
"Tasks": "speech recognition",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/1997_hub5_arabic_transcripts.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "1997 HUB5 Arabic Transcripts",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2002T39",
"License": "LDC User Agreement for Non-Members",
"Year": 2002,
"Language": "ar",
"Dialect": "ar-EG: (Arabic (Egypt))",
"Domain": "transcribed audio",
"Form": "text",
"Collection Style": "other",
"Description": "There are 40 data files. Each of the 20 calls has transcripts in two formats: .txt and .scr.",
"Volume": "40",
"Unit": "documents",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "nan",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "500.00 $",
"Test Split": "No",
"Tasks": "speech recognition",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/2003_nist_language_recognition_evaluation.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2003 NIST Language Recognition Evaluation",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2006S31",
"License": "LDC User Agreement for Non-Members",
"Year": 2006,
"Language": "multilingual",
"Dialect": "ar-EG: (Arabic (Egypt))",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "Each speech file is one side of a \"four wire\" telephone conversation represented as 8-bit, 8-kHz mulaw data. There are 11,830 speech files in SPHERE (.sph) format. The speech data was compiled from LDC's CALLFRIEND, CALLHOME, and Switchboard-2 corpora. Each file contains one test segment. The test segments are divided into three-second, 10-second, and 30-second tests, each in its own directory.",
"Volume": "46",
"Unit": "hours",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "500.00 $",
"Test Split": "No",
"Tasks": "language identification",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/2003_nist_rich_transcription_evaluation_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2003 NIST Rich Transcription Evaluation Data",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2007S10",
"License": "LDC User Agreement for Non-Members",
"Year": 2007,
"Language": "multilingual",
"Dialect": "mixed",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "The BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts.",
"Volume": "1",
"Unit": "hours",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "2,000.00 $",
"Test Split": "No",
"Tasks": "speech recognition",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/2005_nist_speaker_recognition_evaluation_test_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2005 NIST Speaker Recognition Evaluation Test Data",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2011S04",
"License": "LDC User Agreement for Non-Members",
"Year": 2011,
"Language": "multilingual",
"Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "The speech data consists of conversational telephone speech with multi-channel data collected by LDC simultaneously from a number of auxiliary microphones. The files are organized into two segments: 10 second two-channel excerpts (continuous segments from single conversations that are estimated to contain approximately 10 seconds of actual speech in the channel of interest) and five minute two-channel conversations.",
"Volume": "525",
"Unit": "hours",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "400.00 $",
"Test Split": "No",
"Tasks": "speaker identification",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2005 NIST Speaker Recognition Evaluation Training Data",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2011S01",
"License": "LDC User Agreement for Non-Members",
"Year": 2011,
"Language": "multilingual",
"Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "The speech data consists of conversational telephone speech with multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into two segments: 10 second two-channel excerpts (continuous segments from single conversations that are estimated to contain approximately 10 seconds of actual speech in the channel of interest) and five minute two-channel conversations.",
"Volume": "392",
"Unit": "hours",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "350.00 $",
"Test Split": "No",
"Tasks": "speaker identification",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
36 changes: 36 additions & 0 deletions datasets/2006_conll_shared_task_-_arabic_&_czech.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2006 CoNLL Shared Task - Arabic & Czech",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2015T12",
"License": "LDC User Agreement for Non-Members",
"Year": 2006,
"Language": "multilingual",
"Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
"Domain": "news articles",
"Form": "text",
"Collection Style": "other",
"Description": "2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.",
"Volume": "nan",
"Unit": "tokens",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "PADT",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab-Latn",
"Tokenized": "No",
"Host": "LDC",
"Access": "Upon-Request",
"Cost": "nan",
"Test Split": "No",
"Tasks": "syntactic parsing",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"Name": "2006 NIST Speaker Recognition Evaluation Test Set Part 1",
"Subsets": [],
"HF Link": "nan",
"Link": "https://catalog.ldc.upenn.edu/LDC2011S10",
"License": "LDC User Agreement for Non-Members",
"Year": 2011,
"Language": "multilingual",
"Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
"Domain": "transcribed audio",
"Form": "spoken",
"Collection Style": "other",
"Description": "The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2, and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai, and Urdu.",
"Volume": "437",
"Unit": "hours",
"Ethical Risks": "Low",
"Provider": "LDC",
"Derived From": "nan",
"Paper Title": "nan",
"Paper Link": "nan",
"Script": "Arab",
"Tokenized": "No",
"Host": "LDC",
"Access": "With-Fee",
"Cost": "300.00 $",
"Test Split": "No",
"Tasks": "speaker identification",
"Venue Title": "nan",
"Citations": "nan",
"Venue Type": "nan",
"Venue Name": "nan",
"Authors": "nan",
"Affiliations": "nan",
"Abstract": "nan",
"Added By": "Zaid Alyafeai"
}
Loading

0 comments on commit dd11107

Please sign in to comment.