diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..11c0170 --- /dev/null +++ b/404.html @@ -0,0 +1,567 @@ + + + +
+ + + + + + + + + + + + + + + + + +There are two ways of accessing OSCAR: through Huma-Num, or through HuggingFace. +Depending on your status, you might not have the choice.
++ | Research/Academic | +Individual | +
---|---|---|
Huma-Num | ++ | + |
Hugging-Face | ++ | + |
You can apply for an access request by sending us an email!
+Warning
+Carefully respect the following instructions, as incorrect submissions might significantly delay your access.
+Danger
+Do not create an account by yourselves, as it could delay you access by weeks! We will create an account for you.
+Send us an email at contact at oscar-project.org, with OSCAR Access Request as the title, and the following (completed) as the body:
+Warning
+Please send your email using your institutional/academic address when possible. Otherwise, your access might be delayed/refused.
+- First name:
+- Last name:
+- Affiliation:
+- Contact details:
+- Corpus version:
+- Languages:
+
++ a short description of your usecase.
+
Note
+Access requests can take some days to be answered, sometimes more.
+We post updates on our Discord server on exceptional delays, and you can always contact us there to inquire about yours.
+After some time, you should get an email back from us with access instructions!
+datasets
The following implies that you already have installed the Python datasets library
+After all of this, you should be able to easily use OSCAR data with the datasets
library :
# example with OSCAR 2201
+from datasets import load_dataset
+
+
+dataset = load_dataset("oscar-corpus/OSCAR-2201",
+ use_auth_token=True, # required
+ language="ar",
+ streaming=True, # optional
+ split="train") # optional
+
+for d in dataset:
+ print(d) # prints documents
+
You can also get the raw data from HuggingFace using Git LFS.
+The following steps assume you have git and git-lfs installed, and are on a UNIX system. +The procedure should roughly be the same on Windows, but hasn’t been attempted.
+This will download the Basque corpus from OSCAR 2109.
+ +The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.
+Getting access +Latest version +Quickstart guide
+Info
+The new OSCAR 2301 is available!
+This website aims to gather information about the corpus in a technical point of view:
+OSCAR is a collection of web-based multilingual corpus of several terabytes, containing subcorpora in more than 150 languages.
+Each OSCAR Corpus has a version name that tells you its approximate generation time, which usually coincides with the source crawl time. +The latest OSCAR Corpus is OSCAR 2301. +We advise you to always use the latest version, as we incrementally include new features that enable new ways of filtering the corpus for your applications.
+OSCAR is, since OSCAR 2109, document-oriented, which means that subcorpora are comprised of documents rather than individual lines.
+This has important implications as to how to preprocess the data:
+You can (and will) find sentences in other languages than the one you're interested in. For example, it is expected to encounter English sentences in documents from the French subcorpus.
+Example
+The Wikipedia article about the French anthem, La Marseillaise, contains its lyrics in French. +As such, this article is expected to be present in the English subcorpus with those French lyrics.
+The good news is that you can easily remove those sentences if you are not interested in them, thanks to the metadata provided alongside the main content.
+OSCAR is distributed in JSONLines files, usually compressed (gzip
, zstd
depending on the version).
Each line of a file is a JSON Object representing a single document. +Here is an example from OSCAR 2301:
+{
+ "content":"English sentence\nphrase en français\n????????????", // (1)
+ "warc_headers":{ // (2)
+ "warc-identified-content-language":"fra,eng",
+ "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
+ "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
+ "warc-type":"conversion",
+ "content-length":"35298", // (3)
+ "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
+ "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
+ "warc-date":"2022-11-26T09:45:47Z",
+ "content-type":"text/plain"
+ },
+ "metadata":{
+ "identification":{ // (4)
+ "label":"fr",
+ "prob":0.8938327
+ },
+ "harmful_pp":4063.1814, // (5)
+ "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
+ "quality_warnings":[ // (7)
+ "short_sentences",
+ "header",
+ "footer"
+ ],
+ "categories":[ // (8)
+ "examen_pix",
+ "liste_bu"
+ ],
+ "sentence_identifications":[ // (9)
+ {
+ "label":"fr",
+ "prob":0.99837273
+ },
+ {
+ "label":"en",
+ "prob":0.9992377
+ },
+ null
+ ]
+ }
+}
+
warc_headers
are copied and content can be altered by Ungoliant at generation stage, content-length
and warc-block-digest
can be different from actual values.harmful_pp
to harmful_ppl
in future releases.annotations
pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.null
value means no identification with a good enough threshold (>0.8 on 23.01).There are different ways of getting access to OSCAR depending on your status! Head on to our dedicated page.
+TODO
+ + + + + + +OSCAR (Open Super-large Crawled Aggregated coRpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
+The new OSCAR schema is a major, breaking change from the previous ones, adding more metadata and laying out documents rather than lines.
+OSCAR Schema v2 groups text and metadata in JSON dictionnaries, in JSONLines format.
+/
+├── af
+│ ├── af_sha256.txt
+│ └── af.jsonl.gz
+├── de
+│ ├── de_sha256.txt # Checksum file
+│ └── de.jsonl.gz # Textual content
+├── en
+│ ├── en_part_1.jsonl.gz # Multipart example
+│ ├── en_part_2.jsonl.gz
+│ └── en_sha256.txt
+├── yi
+│ ├── yi_sha256.txt
+│ └── yi.jsonl.gz
+└── zh
+ ├── zh_sha256.txt
+ └── zh.jsonl.gz
+
.jsonl
filesThese are the metadata, in JSONLines format.
+Each line follows the following JSON Scheme:
+{
+ "$schema": "http://json-schema.org/draft-07/schema#",
+ "title": "Document",
+ "description": "Serializable version of [Document].",
+ "type": "object",
+ "required": [
+ "content",
+ "metadata",
+ "warc_headers"
+ ],
+ "properties": {
+ "content": {
+ "type": "string"
+ },
+ "metadata": {
+ "$ref": "#/definitions/Metadata"
+ },
+ "warc_headers": {
+ "type": "object",
+ "additionalProperties": {
+ "type": "string"
+ }
+ }
+ },
+ "definitions": {
+ "Identification": {
+ "type": "object",
+ "required": [
+ "label",
+ "prob"
+ ],
+ "properties": {
+ "label": {
+ "$ref": "#/definitions/Lang"
+ },
+ "prob": {
+ "type": "number",
+ "format": "float"
+ }
+ }
+ },
+ "Lang": {
+ "type": "string",
+ "enum": [
+ "Af",
+ "Als",
+ "...",
+ "Yue",
+ "Zh",
+ "Multi"
+ ]
+ },
+ "Metadata": {
+ "description": "OSCAR-specific metadata",
+ "type": "object",
+ "required": [
+ "identification",
+ "sentence_identifications"
+ ],
+ "properties": {
+ "annotation": {
+ "type": [
+ "array",
+ "null"
+ ],
+ "items": {
+ "type": "string"
+ }
+ },
+ "identification": {
+ "$ref": "#/definitions/Identification"
+ },
+ "sentence_identifications": {
+ "type": "array",
+ "items": {
+ "anyOf": [
+ {
+ "$ref": "#/definitions/Identification"
+ },
+ {
+ "type": "null"
+ }
+ ]
+ }
+ }
+ }
+ }
+ }
+}
+
Example: +
{
+
+ // text is here, separated by \n characters that have been removed here for lisibility
+ "content": "Adopt-a-user Home • Talk || Adoptee's Area • Resources || Adopter's Area • Resources • List of Adopters || Teahouse || Live Help Chat (IRC)\n
+ Shortcuts\n
+ WP:AAU\n
+ WP:ADOPT\n
+ WP:ADOPTION\n
+ WP:WIKIADOPT\n
+ The Adopt-a-user program is designed to help new and inexperienced users by pairing them with more experienced Wikipedians. These editors (referred to as adopters or mentors) will \"adopt\" newer users, guiding them along the way as they learn about Wikipedia and its various aspects.\n
+ The project aims to inform new users about the ins and outs of Wikipedia and steer them away from making less-than-constructive edits or misplaced test edits. Well over a thousand users have been involved in the program at one time or another.\n
+ So, if you're new or inexperienced and would like to:\nAsk questions about editing, contributing to Wikipedia and creating your first article\n
+ Learn to navigate processes and policies and guidelines\n
+ Get help with article creation or image uploads or any other activities on Wikipedia\n
+ . . .then an adopter should be able to help you. Adoption lasts as long as the adopter and adoptee want to continue, so you can stop any time if you feel you've learned enough, or you'd like to take a break.\n
+ If you are looking to contribute to Wikipedia but do not intend to remain as an active user well after adoption, then this program is not for you. Adoption is for users who intend to be long-term contributors and members of the community, so if you are simply here to create one article, see this page for help and do not request adoption.\n
+ Users who don't want adopting – but who do need help with one-off problems – might like to consider whether the Teahouse question forum, the Help desk, or a {{Help me}} request might be better ways to get quick answers.\n
+ Participation\n
+ Being adopted is easy and fun. Why not select an adopter from the list of adopters and contact them directly to request adoption? If you choose an adopter who shares your interests, they will be more able to assist you while you learn under their tutelage.\n
+ View the list of adopters!\n
+ ...",
+
+ // WARC Headers are extracted and put there untouched.
+ // The content-length shoud not be understood as the current document length, but as the original document length.
+ "warc_headers": {
+ "warc-block-digest": "sha1:U2OJPXXE3JCPSLAB6UPB3TEGBDHKPTAO",
+ "warc-record-id": "<urn:uuid:fec8808f-96ef-4ae5-8a57-df5b44e42dcf>",
+ "warc-identified-content-language": "eng,nno",
+ "content-type": "text/plain",
+ "warc-refers-to": "<urn:uuid:2f59440d-3700-418c-aa94-5c63bab316c3>",
+ "warc-date": "2021-09-16T12:40:45Z",
+ "warc-target-uri": "https://en.wikipedia.org/wiki/Wikipedia:Adopt-a-user",
+ "content-length": "5385",
+ "warc-type": "conversion"
+ },
+
+ // OSCAR metadata
+ "metadata": {
+
+ // Document identification
+ "identification": {
+ "label": "en",
+ "prob": 0.6775619
+ },
+
+ // Annotations of the document
+ "annotation": [
+ "short_sentences",
+ "header",
+ "footer"
+ ],
+
+ // Sentence identifications.
+ // null: identification confidence too low (<0.8)
+ // There is exactly one identification per line.
+ "sentence_identifications": [
+ null,
+ null,
+ {
+ "label": "en",
+ "prob": 0.89475197
+ },
+ {
+ "label": "en",
+ "prob": 0.9124037
+ },
+ {
+ "label": "en",
+ "prob": 0.8080786
+ },
+ null,
+ {
+ "label": "en",
+ "prob": 0.9665413
+ },
+ ]
+ }
+}
+
<lang>_sha256.txt
filesThese are used to check for eventual corruption during download.
+They can be used by running sha256sum -c <lang>_sha256.txt
.
The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.
Getting access Latest version Quickstart guide
Info
The new OSCAR 2301 is available!
This website aims to gather information about the corpus in a technical point of view:
There are two ways of accessing OSCAR: through Huma-Num, or through HuggingFace. Depending on your status, you might not have the choice.
Research/Academic Individual Huma-Num Hugging-Face Huma-NumHuggingFaceYou can apply for an access request by sending us an email!
Warning
Carefully respect the following instructions, as incorrect submissions might significantly delay your access.
Danger
Do not create an account by yourselves, as it could delay you access by weeks! We will create an account for you.
Send us an email at contact at oscar-project.org, with OSCAR Access Request as the title, and the following (completed) as the body:
Warning
Please send your email using your institutional/academic address when possible. Otherwise, your access might be delayed/refused.
- First name:\n- Last name:\n- Affiliation:\n- Contact details:\n- Corpus version: \n- Languages:\n\n+ a short description of your usecase.\n
Note
Access requests can take some days to be answered, sometimes more.
We post updates on our Discord server on exceptional delays, and you can always contact us there to inquire about yours.
After some time, you should get an email back from us with access instructions!
"},{"location":"accessing/#using-datasets","title":"Usingdatasets
","text":"The following implies that you already have installed the Python datasets library
After all of this, you should be able to easily use OSCAR data with the datasets
library :
# example with OSCAR 2201\nfrom datasets import load_dataset\ndataset = load_dataset(\"oscar-corpus/OSCAR-2201\",\nuse_auth_token=True, # required\nlanguage=\"ar\", \nstreaming=True, # optional\nsplit=\"train\") # optional\nfor d in dataset:\nprint(d) # prints documents\n
"},{"location":"accessing/#using-git-lfs","title":"Using Git LFS","text":"You can also get the raw data from HuggingFace using Git LFS.
The following steps assume you have git and git-lfs installed, and are on a UNIX system. The procedure should roughly be the same on Windows, but hasn\u2019t been attempted.
This will download the Basque corpus from OSCAR 2109.
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 cd OSCAR-2109 # go inside the directory\ngit lfs pull --include packaged/eu/eu.txt.gz # pull the required file(s) (here the Basque corpus). Check with the manpage for pull options\n
"},{"location":"quickstart/","title":"OSCAR Quickstart","text":""},{"location":"quickstart/#what-is-oscar","title":"What is OSCAR?","text":"OSCAR is a collection of web-based multilingual corpus of several terabytes, containing subcorpora in more than 150 languages.
Each OSCAR Corpus has a version name that tells you its approximate generation time, which usually coincides with the source crawl time. The latest OSCAR Corpus is OSCAR 2301. We advise you to always use the latest version, as we incrementally include new features that enable new ways of filtering the corpus for your applications.
"},{"location":"quickstart/#basic-data-layout","title":"Basic data layout","text":"OSCAR is, since OSCAR 2109, document-oriented, which means that subcorpora are comprised of documents rather than individual lines.
This has important implications as to how to preprocess the data:
You can (and will) find sentences in other languages than the one you're interested in. For example, it is expected to encounter English sentences in documents from the French subcorpus.
Example
The Wikipedia article about the French anthem, La Marseillaise, contains its lyrics in French. As such, this article is expected to be present in the English subcorpus with those French lyrics.
The good news is that you can easily remove those sentences if you are not interested in them, thanks to the metadata provided alongside the main content.
OSCAR is distributed in JSONLines files, usually compressed (gzip
, zstd
depending on the version).
Each line of a file is a JSON Object representing a single document. Here is an example from OSCAR 2301:
{\n\"content\":\"English sentence\\nphrase en fran\u00e7ais\\n????????????\", // (1)\n\"warc_headers\":{ // (2)\n\"warc-identified-content-language\":\"fra,eng\",\n\"warc-target-uri\":\"https://fr.wikipedia.org/wiki/...\",\n\"warc-record-id\":\"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>\",\n\"warc-type\":\"conversion\",\n\"content-length\":\"35298\", // (3)\n\"warc-refers-to\":\"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>\",\n\"warc-block-digest\":\"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB\", // (3)\n\"warc-date\":\"2022-11-26T09:45:47Z\",\n\"content-type\":\"text/plain\"\n},\n\"metadata\":{\n\"identification\":{ // (4)\n\"label\":\"fr\",\n\"prob\":0.8938327\n},\n\"harmful_pp\":4063.1814, // (5)\n\"tlsh\":\"tlsh:T125315FF2B6088901EEA097015DB39B4600B...\", // (6)\n\"quality_warnings\":[ // (7)\n\"short_sentences\",\n\"header\",\n\"footer\"\n],\n\"categories\":[ // (8)\n\"examen_pix\",\n\"liste_bu\"\n],\n\"sentence_identifications\":[ // (9)\n{\n\"label\":\"fr\",\n\"prob\":0.99837273\n},\n{\n\"label\":\"en\",\n\"prob\":0.9992377\n},\nnull\n]\n}\n}\n
warc_headers
are copied and content can be altered by Ungoliant at generation stage, content-length
and warc-block-digest
can be different from actual values.harmful_pp
to harmful_ppl
in future releases.annotations
pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.null
value means no identification with a good enough threshold (>0.8 on 23.01).There are different ways of getting access to OSCAR depending on your status! Head on to our dedicated page.
"},{"location":"quickstart/#using-the-corpus","title":"Using the corpus","text":"TODO
"},{"location":"schema/schema-v2/","title":"OSCAR Schema v2","text":"OSCAR (Open Super-large Crawled Aggregated coRpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
The new OSCAR schema is a major, breaking change from the previous ones, adding more metadata and laying out documents rather than lines.
"},{"location":"schema/schema-v2/#changes","title":"Changes","text":"OSCAR Schema v2 groups text and metadata in JSON dictionnaries, in JSONLines format.
/\n\u251c\u2500\u2500 af\n\u2502 \u251c\u2500\u2500 af_sha256.txt\n\u2502 \u2514\u2500\u2500 af.jsonl.gz\n\u251c\u2500\u2500 de\n\u2502 \u251c\u2500\u2500 de_sha256.txt # Checksum file \n\u2502 \u2514\u2500\u2500 de.jsonl.gz # Textual content\n\u251c\u2500\u2500 en\n\u2502 \u251c\u2500\u2500 en_part_1.jsonl.gz # Multipart example\n\u2502 \u251c\u2500\u2500 en_part_2.jsonl.gz\n\u2502 \u2514\u2500\u2500 en_sha256.txt\n\u251c\u2500\u2500 yi\n\u2502 \u251c\u2500\u2500 yi_sha256.txt\n\u2502 \u2514\u2500\u2500 yi.jsonl.gz\n\u2514\u2500\u2500 zh\n \u251c\u2500\u2500 zh_sha256.txt\n \u2514\u2500\u2500 zh.jsonl.gz\n
"},{"location":"schema/schema-v2/#file-formats","title":"File formats","text":""},{"location":"schema/schema-v2/#jsonl-files","title":".jsonl
files","text":"These are the metadata, in JSONLines format.
Each line follows the following JSON Scheme:
{\n\"$schema\": \"http://json-schema.org/draft-07/schema#\",\n\"title\": \"Document\",\n\"description\": \"Serializable version of [Document].\",\n\"type\": \"object\",\n\"required\": [\n\"content\",\n\"metadata\",\n\"warc_headers\"\n],\n\"properties\": {\n\"content\": {\n\"type\": \"string\"\n},\n\"metadata\": {\n\"$ref\": \"#/definitions/Metadata\"\n},\n\"warc_headers\": {\n\"type\": \"object\",\n\"additionalProperties\": {\n\"type\": \"string\"\n}\n}\n},\n\"definitions\": {\n\"Identification\": {\n\"type\": \"object\",\n\"required\": [\n\"label\",\n\"prob\"\n],\n\"properties\": {\n\"label\": {\n\"$ref\": \"#/definitions/Lang\"\n},\n\"prob\": {\n\"type\": \"number\",\n\"format\": \"float\"\n}\n}\n},\n\"Lang\": {\n\"type\": \"string\",\n\"enum\": [\n\"Af\",\n\"Als\",\n\"...\",\n\"Yue\",\n\"Zh\",\n\"Multi\"\n]\n},\n\"Metadata\": {\n\"description\": \"OSCAR-specific metadata\",\n\"type\": \"object\",\n\"required\": [\n\"identification\",\n\"sentence_identifications\"\n],\n\"properties\": {\n\"annotation\": {\n\"type\": [\n\"array\",\n\"null\"\n],\n\"items\": {\n\"type\": \"string\"\n}\n},\n\"identification\": {\n\"$ref\": \"#/definitions/Identification\"\n},\n\"sentence_identifications\": {\n\"type\": \"array\",\n\"items\": {\n\"anyOf\": [\n{\n\"$ref\": \"#/definitions/Identification\"\n},\n{\n\"type\": \"null\"\n}\n]\n}\n}\n}\n}\n}\n}\n
Example:
{\n// text is here, separated by \\n characters that have been removed here for lisibility\n\"content\": \"Adopt-a-user Home \u2022 Talk || Adoptee's Area \u2022 Resources || Adopter's Area \u2022 Resources \u2022 List of Adopters || Teahouse || Live Help Chat (IRC)\\n\n Shortcuts\\n\n WP:AAU\\n\n WP:ADOPT\\n\n WP:ADOPTION\\n\n WP:WIKIADOPT\\n\n The Adopt-a-user program is designed to help new and inexperienced users by pairing them with more experienced Wikipedians. These editors (referred to as adopters or mentors) will \\\"adopt\\\" newer users, guiding them along the way as they learn about Wikipedia and its various aspects.\\n\n The project aims to inform new users about the ins and outs of Wikipedia and steer them away from making less-than-constructive edits or misplaced test edits. Well over a thousand users have been involved in the program at one time or another.\\n\n So, if you're new or inexperienced and would like to:\\nAsk questions about editing, contributing to Wikipedia and creating your first article\\n\n Learn to navigate processes and policies and guidelines\\n\n Get help with article creation or image uploads or any other activities on Wikipedia\\n\n . . .then an adopter should be able to help you. Adoption lasts as long as the adopter and adoptee want to continue, so you can stop any time if you feel you've learned enough, or you'd like to take a break.\\n\n If you are looking to contribute to Wikipedia but do not intend to remain as an active user well after adoption, then this program is not for you. Adoption is for users who intend to be long-term contributors and members of the community, so if you are simply here to create one article, see this page for help and do not request adoption.\\n\n Users who don't want adopting \u2013 but who do need help with one-off problems \u2013 might like to consider whether the Teahouse question forum, the Help desk, or a {{Help me}} request might be better ways to get quick answers.\\n\n Participation\\n\n Being adopted is easy and fun. Why not select an adopter from the list of adopters and contact them directly to request adoption? If you choose an adopter who shares your interests, they will be more able to assist you while you learn under their tutelage.\\n\n View the list of adopters!\\n\n ...\",\n// WARC Headers are extracted and put there untouched.\n// The content-length shoud not be understood as the current document length, but as the original document length.\n\"warc_headers\": {\n\"warc-block-digest\": \"sha1:U2OJPXXE3JCPSLAB6UPB3TEGBDHKPTAO\",\n\"warc-record-id\": \"<urn:uuid:fec8808f-96ef-4ae5-8a57-df5b44e42dcf>\",\n\"warc-identified-content-language\": \"eng,nno\",\n\"content-type\": \"text/plain\",\n\"warc-refers-to\": \"<urn:uuid:2f59440d-3700-418c-aa94-5c63bab316c3>\",\n\"warc-date\": \"2021-09-16T12:40:45Z\",\n\"warc-target-uri\": \"https://en.wikipedia.org/wiki/Wikipedia:Adopt-a-user\",\n\"content-length\": \"5385\",\n\"warc-type\": \"conversion\"\n},\n// OSCAR metadata\n\"metadata\": {\n// Document identification\n\"identification\": {\n\"label\": \"en\",\n\"prob\": 0.6775619\n},\n// Annotations of the document\n\"annotation\": [\n\"short_sentences\",\n\"header\",\n\"footer\"\n],\n// Sentence identifications.\n// null: identification confidence too low (<0.8)\n// There is exactly one identification per line.\n\"sentence_identifications\": [\nnull,\nnull,\n{\n\"label\": \"en\",\n\"prob\": 0.89475197\n},\n{\n\"label\": \"en\",\n\"prob\": 0.9124037\n},\n{\n\"label\": \"en\",\n\"prob\": 0.8080786\n},\nnull,\n{\n\"label\": \"en\",\n\"prob\": 0.9665413\n}, ]\n}\n}\n
"},{"location":"schema/schema-v2/#lang_sha256txt-files","title":"<lang>_sha256.txt
files","text":"These are used to check for eventual corruption during download. They can be used by running sha256sum -c <lang>_sha256.txt
.
This is currently preferred to just getting it from cargo install ungoliant
.
git clone https://github.com/oscar-project/ungoliant
compil
node: srun --partition=compil -A <GROUP ID>@cpu --pty bash
module load llvm boost cargo
(boost
and llvm
are necessary for compiling KenLM and FastText)cd ungoliant
cargo b --release --features kenlm
We advise the use of the prepost
partition for downloading the data form Common Crawl. However, please bear in mind that jobs are limited to 20hours in the prepost
partition, meaning that you'll likely run out of time before completing the download of a whole Common Crawl dump.
wet.paths.gz
file for the latest release (likely heregzip -d wet.paths.gz
Create a dl_corpus.slurm
file with the following text inside:
#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=get_cc # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<YOUR MAIL> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"64\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <GROUP ID>@cpu\nexport CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>\nexport PATHS_FILE=<PATH TO wet.PATHS>\nexport DST=<DESTINATION>\n\n./target/release/ungoliant download $PATHS_FILE $DST\n
When the time has run out, you have to ensure that the last downloaded shards weren't corrupted (because of a potential kill while downloading).
Then, after potentially removing faulty shards, run the following slurm job. The only difference with the previous one is the use of the -o n
parameter on ungoliant download
, which will ignore the first n
lines of the wet.paths
. You can/should also use another DESTINATION
folder, and then do the merge by hand.
#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=get_cc # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<YOUR MAIL> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"64\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <GROUP ID>@cpu\nexport CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>\nexport PATHS_FILE=<PATH TO wet.PATHS>\nexport DST=<DESTINATION>\n\n./target/release/ungoliant download -o <NB_DOWNLOADED> $PATHS_FILE $DST\n
You can then check that no shards are missing:
import os\nshards_dir = \"./shards\"\npaths_file = \"wet.paths\"\ncc_rooturl = \"https://data.commoncrawl.org/\"\nmissing_shards = list()\nfor i in range(88000):\nif not os.path.isfile(f\"{shards_dir}/{i}.txt.gz\"):\nmissing_shards.append(i)\nprint(f\"missing {len(missing_shards)} shards\")\nwith open(paths_file) as f:\nshard_paths = f.readlines()\nfor missing_shard_number in missing_shards:\nprint(\nf\"wget -nc {cc_rooturl}{shard_paths[missing_shard_number].strip()} -O {missing_shard_number}.txt.gz\"\n)\n
This will give you the wget
commands to get the missing shards, with a -nc
param to avoid overwriting already existing files.
When you have your shards ready, create a new SLURM file with:
We use a QoS of t4 because since we can only use one node and corpus generation time is likely >20h, we need the 100 mark.
Other strategies could be tested (for example, splitting CC data into 4 buckets and launch 4 ungoliant
jobs. Then, merging back the datasets should be done. Note that in that case, rebuild files will be less efficient (since we'll have 4 of them)
#! /bin/bash\n#SBATCH --partition=cpu_p1\n#SBATCH --job-name=gen_oscar # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<YOUR MAIL> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"40\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"100:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH --qos=qos_cpu-t4\n#SBATCH -A <GROUP ID>@cpu\nexport CARGO_HOME=<CARGO HOME PATH>\nexport CC_FOLDER=<SHARDS PATH>\nexport KENLM_FOLDER=<PATH TO KENLMS MODELS IF APPLICABLE>\nexport CORPUS=<DESTINATION FOLDER>\nexport BLOCKLIST=<BLOCKLIST FOLDER (must contain subfolders with category names..)>\nexport LID_PATH=<PATH TO FASTTEXT LangID>\nexport UNGOLIANT_PATH=<PATH TO UNGOLIANT BINARY>\n\n$UNGOLIANT_PATH pipeline $CC_FOLDER $CORPUS --blocklist-path $BLOCKLIST --kenlms-path $KENLM_FOLDER --lid-path $LID_PATH\n
As of Jan. 2023, using ungoliant 1.3.0 ([c14acc8](https://github.com/oscar-project/ungoliant/tree/c14acc8c6a87913d138a022cf4819024d66b3e06))
, with a 88,000-shard dump of CommonCrawl (November/December 2022, ~9.5TB compressed), this process took around 20 hours and yielded a corpus weighing arount 12TB (uncompressed).
Files in $SCRATCH
are deleted after 30 days if no R/W is operated on them. You should move out files to $STORE
if you plan on keeping them. Unfortunately, due to the file size, you'll need to launch another job to do the copying of the files.
Warning
rsync -n
enables a dry-run, enabling you to see which files would be moved, and where. Remove the -n
parameter when you want to perform the actual copy.
#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=copy_oscar # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=julien.abadji@inria.fr # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"4\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <GROUP ID>@cpu\nexport SRC=<CORPUS SOURCE>\nexport DST=<CORPUS DESTINATION>\n\nrsync -anvP $SRC $DST\n
On the same example as before, copying took around 9 hours.
"},{"location":"tools/generation-jeanzay/#preparing-for-release","title":"Preparing for release","text":""},{"location":"tools/generation-jeanzay/#splitting","title":"Splitting","text":"We use oscar-tools
to split the corpus.
Note
At the time of writing, oscar-tools
is not available via crates.io/cargo install
, so you have to compile it from source. Luckily, it's easy.
oscar-tools
git clone https://github.com/oscar-project/oscar-tools
compil
node: srun --partition=compil -A <GROUP ID>@cpu --pty bash
cd oscar-tools
CARGO_HOME=<Somewhere not in your ~, like $SCRATCH/.cargo> cargo b --features zstd --release
.target/release/oscar-tools
.#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=split_oscar # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<Your email address> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"10\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <group id>@cpu\nexport OSCAR_TOOLS_BIN=<path to oscar-tools binary>\nexport CORPUS=<path to corpus>\nexport DST=<where the split corpus will be put>\n\n$OSCAR_TOOLS_BIN v2 split $CORPUS $DST -s 2000\n
This step took around 3 hours (assuming both CORPUS
and DST
are on $SCRATCH
).
#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=compress_oscar # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<email address> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"48\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <group id>@cpu\nexport OSCAR_TOOLS_BIN=<link to oscar-tools binary>\nexport CORPUS=<path to split focus>\nexport DST=<where the compressed ocrpus will be saved>\n\n$OSCAR_TOOLS_BIN v2 compress $CORPUS $DST\n
This step took around 2 hours, going from 12TB to 3.3TB
"},{"location":"tools/generation-jeanzay/#checksuming","title":"Checksuming","text":"The last step is to create checksum
files for each language, so that people can check that their downloads have been successful. Also, it acts as a split list for download-oscar.
#! /bin/bash\n#SBATCH --partition=prepost\n#SBATCH --job-name=compress_oscar # create a short name for your job\n#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=<email address> # Where to send mail\n#SBATCH --nodes=\"1\" #Combien de n\u0153uds\n#SBATCH --ntasks-per-node=\"1\" # Une t\u00e2che par GPU\n#SBATCH --cpus-per-task=\"48\" # nombre de coeurs \u00e0 r\u00e9server par t\u00e2che\n#SBATCH --time=\"20:00:00\" # temps d'ex\u00e9cution maximum demande (HH:MM:SS)\n#SBATCH -A <group id>@cpu\nexport OSCAR_TOOLS_BIN=<link to oscar-tools binary>\nexport CORPUS=<path to split focus>\n\n$OSCAR_TOOLS_BIN v2 checksum $CORPUS\n
The process took around 2 hours.
"},{"location":"tools/oscar-tools/","title":"oscar-tools","text":"oscar-tools
is a toolkit that was created along with OSCAR-2201 to make operations on the corpus easy and fast.
At its core, oscar-tools
provides a set of operations targeted at a given OSCAR version. As such, you shoudn't expect to have all operations available on all OSCAR versions. For example, at the time of writing, deduplicate
is not available for OSCAR 22.01-like corpora.
The CLI of oscar-tools
is still a bit messy and can be confusing, because we are actively working on it and on implementing essential features.
releases
","text":"Note
Binaries are not available yet.
"},{"location":"tools/oscar-tools/#from-cargo","title":"Fromcargo
","text":"Note
cargo install oscar-tools
is not available yet.
Note
This could evolve rapidly.
Right now the latest version sits on the dev-oscario
branch, where we're slowly replacing inline IO blocks by our Corpus IO library, oscar-io
.
$> git clone https://github.com/oscar-corpus/oscar-tools #Clone the repository\n$> cd oscar-tools\n$> git checkout dev-oscario #Change branch\n$> cargo b --release #Build the project. \n$> # Building might take some time because of \n$> # the parquet dependency that will soon be optional.\n$> touch target/release/oscar-tools #Binary is here and self-sufficient.\n
"},{"location":"tools/oscar-tools/#usage","title":"Usage","text":"oscar-tools --help
might help you find the parameters/operations you're looking for.
Note
In the tool, v1
corresponds to 2019-like corpora, whereas v2
corresponds to 22.01-like corpora.
Each operation has different parameters.
"},{"location":"tools/oscar-tools/#v1-oscar-2019","title":"v1 / OSCAR 2019","text":"At the time of writing, the only operation available is dedup
. It uses runiq
to deduplicate corpora.
oscar-tools-v1-dedup \nline deduplication\n\nUSAGE:\n oscar-tools v1 dedup [ARGS]\n\nARGS:\n <SOURCE> Corpus source file.\n <DESTINATION> Corpus destination file. Should not exist.\n\nOPTIONS:\n -h, --help Print help information\n
"},{"location":"tools/oscar-tools/#v2-oscar-2201","title":"v2 / OSCAR 22.01","text":"There is a lot more operations implemented on OSCAR 22.01-like corpora.
"},{"location":"tools/oscar-tools/#extract-tags","title":"extract-tags
","text":"extract-tags
extracts documents that meet certain annotation constraints.
oscar-tools-v2-extract-tags \nExtracts a OSCAR v2 corpus restricting tags. Included tags must be present and excluded ones must be\nabsent. Use --clean to extract documents with no annotation only\n\nUSAGE:\n oscar-tools v2 extract-tags [OPTIONS] [--] [ARGS]\n\nARGS:\n <SOURCE> Corpus source file/folder. If folder, splits corpus files in provided\n folder\n <DESTINATION> Corpus source file/folder. If folder, splits corpus files in provided\n folder\n\nOPTIONS:\n --clean only return documents with no tags. include and exclude will be\n ignored\n -e, --exclude <tags>... space separated tags to exclude.\n -h, --help Print help information\n -i, --include <tags>... space separated tags to include.\n
"},{"location":"tools/oscar-tools/#extract-text","title":"extract-text
","text":"extract-text
\"converts\" a 2201-like corpus into a 2019-like corpus, by removing all metadata and only storing sentences. Keep in mind that while the format will be similar to 2109-like corpora, the filtering is a bit different and lines from other languages won't be stripped.
Extract text from documents. The output will be a OSCAR v1 (2019)-compatible corpus.\n\nUSAGE:\n oscar-tools v2 extract-text [OPTIONS] <SOURCE> <DESTINATION>\n\nARGS:\n <SOURCE> Corpus source file.\n <DESTINATION> Corpus destination file (OSCAR v1 (2019)-like)\n\nOPTIONS:\n --del_src If set, deletes source files as they are being extracted.\n -h, --help Print help information\n
"},{"location":"versions/mOSCAR/","title":"mOSCAR","text":"mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality.
"},{"location":"versions/mOSCAR/#access","title":"Access","text":"Access to the mOSCAR is granted via the Hugging Face Hub.
All data is avaialble at https://huggingface.co/datasets/oscar-corpus/mOSCAR.
"},{"location":"versions/mOSCAR/#layout","title":"Layout","text":"To Come ...
"},{"location":"versions/mOSCAR/#language-table","title":"Language table","text":"Lang. name Code Family Script # documents # images # tokens Acehnese ace_Latn Latin 7,803 32,461 2,889,134 Mesopotamian Arabic acm_Arab Arabic 2,274 10,620 1,047,748 Tunisian Arabic aeb_Arab Arabic 7,640 41,570 2,715,187 Afrikaans afr_Latn Latin 54,895 247,774 39,956,585 South Levantine Arabic ajp_Arab Arabic 12,098 87,837 5,167,813 Tosk Albanian als_Latn Latin 861,678 2,569,164 452,737,251 Amharic amh_Ethi Ge'ez 39,588 152,646 35,089,019 North Levantine Arabic apc_Arab Arabic 19,904 128,966 9,560,701 Modern Standard Arabic arb_Arab Arabic 3,936,851 15,126,931 3,401,919,964 Najdi Arabic ars_Arab Arabic 60,229 296,741 43,610,873 Moroccan Arabic ary_Arab Arabic 142,386 698,051 204,723,454 Egyptian Arabic arz_Arab Arabic 835,529 4,054,632 653,626,387 Assamese asm_Beng Bengali 3,948 9,210 640,390 Asturian ast_Latn Latin 165,745 962,723 37,547,944 Awadhi awa_Deva Devanagari 29,324 107,483 4,961,635 Central Aymara ayr_Latn Latin 27,384 151,889 5,148,970 South Azerbaijani azb_Arab Arabic 8,274 38,233 5,256,693 North Azerbaijani azj_Latn Latin 516,021 1,808,060 257,825,849 Bashkir bak_Cyrl Cyrillic 4,532 17,174 3,038,766 Bambara bam_Latn Latin 7,674 39,190 1,243,332 Balinese ban_Latn Latin 1,886 11,266 542,015 Belarusian bel_Cyrl Cyrillic 63,309 287,539 72,976,520 Bemba bem_Latn Latin 1,096 7,479 1,340,471 Bengali ben_Beng Bengali 270,406 947,035 35,858,814 Bhojpuri bho_Deva Devanagari 6,366 28,131 875,463 Banjar bjn_Latn Latin 5,427 27,803 1,898,526 Bosnian bos_Latn Latin 1,960,599 7,633,049 1,255,000,505 Buginese bug_Latn Latin 3,312 18,648 588,678 Bulgarian bul_Cyrl Cyrillic 2,591,998 11,670,028 1,760,971,620 Catalan cat_Latn Latin 1,153,864 4,736,634 606,447,390 Cebuano ceb_Latn Latin 16,990 91,234 10,748,818 Czech ces_Latn Latin 3,918,837 13,291,309 2,823,172,996 Central Kurdish ckb_Arab Arabic 36,725 136,566 22,322,689 Crimean Tatar crh_Latn Latin 6,376 24,124 1,742,727 Welsh cym_Latn Latin 40,408 165,897 27,748,345 Danish dan_Latn Latin 2,076,298 9,559,600 1,238,277,499 German deu_Latn Latin 20,662,696 87,976,200 8,544,986,218 Southwestern Dinka dik_Latn Latin 1,712 6,635 1,319,943 Greek ell_Grek Greek 4,916,081 15,209,058 2,923,201,041 English eng_Latn Latin 52,215,013 207,904,315 33,570,108,782 Esperanto epo_Latn Latin 25,157 124,996 28,586,195 Estonian est_Latn Latin 1,040,368 5,217,366 619,215,048 Basque eus_Latn Latin 849,043 3,445,539 277,145,498 Faroese fao_Latn Latin 15,411 60,340 6,691,327 Fijian fij_Latn Latin 1,528 8,776 487,388 Finnish fin_Latn Latin 2,396,033 10,365,333 1,781,044,864 French fra_Latn Latin 20,305,739 78,179,601 14,362,579,829 Friulian fur_Latn Latin 37,290 256,456 5,949,600 Nigerian Fulfulde fuv_Latn Latin 1,568 7,124 401,852 West Central Oromo gaz_Latn Latin 4,058 11,763 1,786,093 Scottish Gaelic gla_Latn Latin 29,710 153,249 14,605,090 Irish gle_Latn Latin 68,858 315,132 47,438,400 Galician glg_Latn Latin 518,973 2,381,475 217,063,180 Guarani grn_Latn Latin 490,945 2,416,633 89,921,114 Gujarati guj_Gujr Gujarati 23,062 91,320 3,324,866 Haitian Creole hat_Latn Latin 257,745 1,570,699 62,847,106 Hausa hau_Latn Latin 25,364 104,934 13,089,932 Hebrew heb_Hebr Hebrew 1,109,591 4,766,483 893,327,320 Hindi hin_Deva Devanagari 579,430 1,830,667 122,558,353 Chhattisgarhi hne_Deva Devanagari 1,581 7,263 273,174 Croatian hrv_Latn Latin 1,719,617 8,425,510 1,010,674,096 Hungarian hun_Latn Latin 3,534,506 15,390,083 2,831,715,050 Armenian hye_Armn Armenian 339,962 1,141,885 205,635,952 Igbo ibo_Latn Latin 11,529 68,049 8,701,070 Ilocano ilo_Latn Latin 78,872 523,195 8,116,113 Indonesian ind_Latn Latin 7,016,291 17,324,777 3,981,843,468 Icelandic isl_Latn Latin 244,676 1,027,465 137,015,973 Italian ita_Latn Latin 12,937,153 47,476,971 8,311,790,842 Javanese jav_Latn Latin 24,785 135,583 16,908,805 Japanese jpn_Jpan Kanji 14,415,292 23,893,768 8,923,348,944 Kabyle kab_Latn Latin 18,508 106,730 4,079,553 Kannada kan_Knda Brahmic Kannada 12,978 42,621 1,442,776 Kashmiri kas_Arab Arabic 3,109 11,408 5,731,910 Georgian kat_Geor Caucasian Georgian 354,436 1,304,281 275,223,026 Kazakh kaz_Cyrl Cyrillic 252,242 732,648 140,049,214 Halh Mongolian khk_Cyrl Cyrillic 124,412 508,217 84,535,241 Khmer khm_Khmr Austroasiatic 24,495 122,243 3,043,925 Kinyarwanda kin_Latn Latin 30,401 172,201 12,049,616 Kyrgyz kir_Cyrl Cyrillic 53,010 199,713 34,404,281 Northern Kurdish kmr_Latn Latin 39,262 164,666 23,834,960 Korean kor_Hang Hanja 2,614,089 13,563,283 2,006,080,705 Lao lao_Laoo 50,611 208,768 31,029,380 Ligurian lij_Latn Latin 8,751 56,266 2,958,179 Limburgish lim_Latn Latin 189,547 1,076,047 42,534,327 Lingala lin_Latn Latin 24,614 152,132 4,053,459 Lithuanian lit_Latn Latin 1,688,811 8,869,443 1,161,476,040 Lombard lmo_Latn Latin 30,506 151,855 9,058,614 Latgalian ltg_Latn Latin 11,948 61,624 4,148,492 Luxembourgish ltz_Latn Latin 44,987 246,346 16,676,872 Ganda lug_Latn Latin 1,878 7,215 789,917 Mizo lus_Latn Latin 7,880 26,817 4,978,472 Standard Latvian lvs_Latn Latin 896,243 4,141,648 587,653,855 Magahi mag_Deva Devanagari 1,097 3,847 205,763 Malayalam mal_Mlym 14,140 52,679 1,689,010 Marathi mar_Deva Devanagari 50,391 163,868 6,689,250 Minangkabau min_Latn Latin 9,341 35,309 1,256,931 Macedonian mkd_Cyrl Cyrillic 542,250 1,853,070 307,232,151 Maltese mlt_Latn Latin 120,888 709,242 36,097,957 Maori mri_Latn Latin 24,322 130,137 24,957,914 Burmese mya_Mymr 8,144 44,188 539,527 Dutch nld_Latn Latin 17,096,727 65,606,013 9,670,041,731 Norwegian Nynorsk nno_Latn Latin 199,355 1,012,313 67,799,774 Norwegian Bokmal nob_Latn Latin 2,229,702 9,698,128 1,294,178,095 Nepali npi_Deva Devanagari 31,239 127,193 3,138,539 Nyanja nya_Latn Latin 12,047 67,192 8,596,769 Occitan oci_Latn Latin 164,852 671,881 59,309,549 Odia ory_Orya 4,319 15,574 378,635 Pangasinan pag_Latn Latin 4,214 32,287 546,071 Eastern Panjabi pan_Guru 11,497 46,168 1,887,991 Papiamento pap_Latn Latin 55,224 363,015 10,002,655 Southern Pasto pbt_Arab Arabic 32,604 110,807 29,170,322 Western Persian pes_Arab Arabic 7,048,946 25,200,571 6,210,479,015 Plateau Malgasy plt_Latn Latin 32,521 120,673 29,263,848 Polish pol_Latn Latin 14,549,605 60,639,244 11,104,144,109 Portuguese por_Latn Latin 8,145,664 26,530,423 4,760,063,083 Dari prs_Arab Arabic 515,041 2,589,859 517,053,967 Ayacucho Quechua quy_Latn Latin 1,578 11,817 362,690 Romanian ron_Latn Latin 5,180,171 17,964,048 3,548,291,261 Rundi run_Latn Latin 20,001 67,096 8,686,054 Russian rus_Cyrl Cyrillic 15,913,845 69,542,828 18,909,213,208 Sango sag_Latn Latin 2,124 13,556 454,455 Sicilian scn_Latn Latin 73,199 424,362 27,110,743 Sinhala sin_Sinh 58,767 221,183 14,270,972 Slovak slk_Latn Latin 3,008,599 15,067,234 1,963,804,563 Slovenian slv_Latn Latin 1,472,025 7,210,285 935,834,754 Samoan smo_Latn Latin 12,346 71,359 14,954,824 Shona sna_Latn Latin 12,698 68,782 6,112,600 Sindhi snd_Arab Arabic 21,095 74,289 17,647,825 Somali som_Latn Latin 77,343 301,429 34,554,975 Southern Sotho sot_Latn Latin 7,718 43,146 6,156,450 Spanish spa_Latn Latin 22,713,366 78,361,087 14,616,773,475 Sardinian srd_Latn Latin 675,539 4,059,493 106,159,957 Serbian srp_Cyrl Cyrillic 604,557 2,286,171 401,223,741 Sundanese sun_Latn Latin 44,310 236,025 13,627,832 Swedish swe_Latn Latin 3,302,730 10,860,518 1,779,284,152 Swahili swh_Latn Latin 137,134 593,418 59,454,896 Silesian szl_Latn Latin 23,535 132,459 5,996,972 Tamil tam_Taml Dravidian Tamil 36,196 167,669 4,834,946 Tatar tat_Cyrl Cyrillic 37,188 143,842 22,831,350 Telugu tel_Telu Brahmic Telugu 22,974 81,033 2,273,772 Tajik tgk_Cyrl Cyrillic 125,236 417,591 90,503,778 Tagalog tgl_Latn Latin 151,437 673,814 97,708,639 Thai tha_Thai Thai 2,983,837 11,621,786 2,839,211,104 Tigrinya tir_Ethi Ge'ez 2,657 8,707 1,725,422 Tok Pisin tpi_Latn Latin 5,063 35,169 460,853 Turkmen tuk_Latn Latin 13,024 57,354 9,766,999 Turkish tur_Latn Latin 4,478,700 12,401,091 2,394,669,068 Twi twi_Latn Latin 3,305 13,634 495,220 Uyghur uig_Arab Arabic 10,713 41,709 6,785,318 Ukrainian ukr_Cyrl Cyrillic 2,721,424 10,929,796 1,928,351,595 Urdu urd_Arab Arabic 407,098 1,239,125 242,007,283 Northern Uzbek uzn_Latn Latin 156,632 798,155 89,022,562 Venetian vec_Latn Latin 330,611 1,830,777 71,077,531 Vietnamese vie_Latn Latin 12,621,521 47,411,488 11,616,191,199 Wolof wol_Latn Latin 4,658 20,380 1,596,432 Xhosa xho_Latn Latin 25,950 142,387 15,809,823 Eastern Yiddish ydd_Hebr 12,486 57,510 17,369,727 Yoruba yor_Latn Latin 56,700 286,933 32,614,558 Yue Chinese yue_Hant 33,671 203,513 24,172,441 Chinese (Simplified) zho_Hans Hanzi 9,861,262 36,152,754 8,078,842,701 Chinese (Traditional) zho_Hant Hant 3,967,966 16,307,258 2,962,854,441 Standard Malay zsm_Latn Latin 1,179,744 5,488,632 432,667,199 Zulu zul_Latn Latin 30,717 156,639 11,345,288"},{"location":"versions/oscar-2019/","title":"OSCAR 2019","text":"OSCAR 2019 is the original 2019 release of the OSCAR corpus. It has been generated from Common Crawl corpus using the goclassy architecture.
"},{"location":"versions/oscar-2019/#features","title":"Features","text":"OSCAR 2019 is shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.
Data is distributed by language in both original and deduplicated form.
If you need the unshuffled version of OSCAR, please contact us using the contact form. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. You can also download it using HuggingFace\u2019s datasets library.
Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.
"},{"location":"versions/oscar-2019/#citing-oscar","title":"Citing OSCAR","text":"If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:
@inproceedings{ortiz-suarez-etal-2020-monolingual,\n title = \"A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages\",\n author = \"Ortiz Su{\\'a}rez, Pedro Javier and\n Romary, Laurent and\n Sagot, Beno{\\^\\i}t\",\n booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics\",\n month = jul,\n year = \"2020\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/2020.acl-main.156\",\n pages = \"1703--1714\",\n abstract = \"We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.\",\n}\n
"},{"location":"versions/oscar-2019/#the-unshuffled-oscar","title":"The Unshuffled OSCAR","text":"If you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.
{{% callout note %}} The unshuffled OSCAR is now available in HuggingFace\u2019s datasets library {{% /callout %}} They have obtained our permission to redistribute the unshuffled OSCAR and they allow users to download a corpus all at once as opposed to file by file. You can get more information about how to download OSCAR using their library by visiting OSCAR's dataset card.
"},{"location":"versions/oscar-2019/#downloading-oscar","title":"Downloading OSCAR","text":"All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. Languages are split in shards of around 700MB, these shards are standalone. A plain text file with checksums is also provided.
The OSCAR corpus is yet to be filtered, so please be careful when using it, specially for text generation tasks! To see which sub-corpora have been audited, please refer to the list of publications above for more information.
You'll be asked to create an HumanID account in order to download a corpus. This is intended, and we do it in order to limit traffic and reduce abuse of the infrastructure. The OSCAR corpus is hosted by Huma-Num, you can read more about them on their website.
All sizes are for the uncompressed files.
Language Words original Size original File original Words deduplicated Size deduplicated File deduplicated Afrikaans 43,482,801 241M af 29,533,437 163M af Albanian 374,196,110 2.3G sq 186,856,699 1.2G sq Alemannic 841,750 5.0M als 459,001 2.8M als Amharic 28,301,601 360M am 16,086,628 206M am Arabic 8,117,162,828 82G ar 3,171,221,354 32G ar Aragonese 52,896 1.3M an 45,669 801K an Armenian 273,919,388 3.7G hy 110,196,043 1.5G hy Assamese 6,956,663 113M as 4,366,570 71M as Asturian 381,005 2.4M ast 325,237 2.0M ast Avaric 24,720 409K av 19,478 324K av Azerbaijani 322,641,710 2.8G az 167,742,296 1.5G az Bashkir 9,796,764 128M ba 6,922,589 90M ba Basque 120,456,652 848M eu 45,359,710 342M eu Bavarian 399 503 bar 399 503 bar Belarusian 144,579,630 1.8G be 83,499,037 1.1G be Bengali 623,575,733 11G bn 363,766,143 5.8G bn Bihari 8,848 110K bh 2,875 34K bh Bishnupriya 198,286 4.1M bpy 96,940 1.7M bpy Bosnian 106,448 447K bs 20,485 116K bs Breton 5,013,241 29M br 2,890,384 16M br Bulgarian 2,947,648,106 32G bg 1,268,114,977 14G bg Burmese 56,111,184 1.9G my 30,102,173 1.1G my Catalan 1,360,212,450 8.0G ca 729,333,440 4.3G ca Cebuano 6,603,567 39M ceb 3,675,024 24M ceb Central Bikol 312 885 bcl 312 885 bcl Central Khmer 20,690,610 1.1G km 10,082,245 581M km Central Kurdish 48,478,334 487M ckb 18,726,721 226M ckb Chavacano 130 520 cbk 130 520 cbk Chechen 711,051 8.3M ce 568,146 6.7M ce Chinese 14,986,424,850 508G zh 6,350,215,113 249G zh Chuvash 3,041,614 39M cv 2,054,810 26M cv Cornish 8,329 44K kw 2,704 14K kw Croatian 34,232,765 226M hr 16,727,640 110M hr Czech 7,715,977,441 53G cs 3,540,997,509 24G cs Danish 2,637,463,889 16G da 1,620,091,317 9.5G da Dhivehi 7,559,472 126M dv 4,726,660 79M dv Dimli 19 146 diq 19 146 diq Dutch 13,020,136,373 78G nl 6,598,786,137 39G nl Eastern Mari 565,992 7.2M mhr 469,297 6.0M mhr Egyptian Arabic 7,305,151 66M arz 3,659,419 33M arz Emilian-Romagnol 6,376 25K eml 6,121 24K eml English 418,187,793,408 2.3T en 215,841,256,971 1.2T en Erzya 90 1.4K myv 78 1.2K myv Esperanto 48,486,161 299M eo 37,324,446 228M eo Estonian 643,163,730 4.8G et 309,931,463 2.3G et Finnish 3,196,666,419 27G fi 1,597,855,468 13G fi French 46,896,036,417 282G fr 23,206,776,649 138G fr Galician 102,011,291 620M gl 63,600,602 384M gl Georgian 171,950,621 3.6G ka 91,569,739 1.9G ka German 44,878,908,446 308G de 21,529,164,172 145G de Goan Konkani 124,277 2.2M gom 102,306 1.8M gom Guarani 7,382 36K gn 4,680 24K gn Gujarati 72,045,701 1.1G gu 50,023,432 722M gu Haitian 1,014 3.9K ht 832 3.3K ht Hebrew 2,067,753,528 20G he 1,032,018,056 9.8G he Hindi 1,372,234,782 17G hi 745,774,934 8.9G hi Hungarian 5,163,936,345 40G hu 2,339,127,555 18G hu Icelandic 219,900,094 1.5G is 129,818,331 846M is Ido 25,702 147K io 22,773 130K io Iloko 142,942 874K ilo 105,564 636K ilo Indonesian 4,574,692,265 30G id 2,394,957,629 16G id Interlingua 180,231 662K ia 100,019 360K ia Interlingue 5,352 24K ie 602 1.6K ie Irish 14,483,593 88M ga 10,017,303 60M ga Italian 22,248,707,341 137G it 11,250,012,896 69G it Japanese 4,962,979,182 216G ja 1,123,067,063 106G ja Javanese 104,896 659K jv 86,654 583K jv Kalmyk 10,277 113K xal 10,155 112K xal Kannada 81,186,863 1.7G kn 49,343,462 1.1G kn Karachay-Balkar 185,436 2.6M krc 166,496 2.3M krc Kazakh 191,126,469 2.7G kk 108,388,743 1.5G kk Kirghiz 44,194,823 600M ky 28,982,620 388M ky Komi 201,404 2.3M kv 95,243 1.2M kv Korean 2,368,765,142 24G ko 1,120,375,149 12G ko Kurdish 15,561,003 94M ku 9,946,440 60M ku Lao 4,133,311 174M lo 2,583,342 114M lo Latin 4,122,201 26M la 1,328,038 8.3M la Latvian 520,761,977 4.0G lv 236,428,905 1.8G lv Lezghian 247,646 3.3M lez 224,871 3.0M lez Limburgan 4,730 29K li 4,283 27K li Lithuanian 1,159,661,742 8.8G lt 516,183,525 3.9G lt Lojban 154,330 736K jbo 141,973 678K jbo Lombard 75,229 443K lmo 73,665 433K lmo Low German 2,906,347 18M nds 2,146,417 13M nds Lower Sorbian 1,787 13K dsb 966 7.1K dsb Luxembourgish 4,403,577 29M lb 3,087,650 21M lb Macedonian 189,289,873 2.1G mk 102,849,595 1.2G mk Maithili 69,161 317K mai 874 11K mai Malagasy 3,068,360 21M mg 1,872,044 13M mg Malay 16,696,882 111M ms 6,045,753 42M ms Malayalam 189,534,472 4.9G ml 95,892,551 2.5G ml Maltese 2,995,654 24M mt 2,163,358 17M mt Marathi 162,609,404 2.7G mr 82,130,803 1.4G mr Mazanderani 73,870 691K mzn 64,481 602K mzn Minangkabau 5,682 608K min 4,825 310K min Mingrelian 299,098 5.8M xmf 228,629 4.4M xmf Mirandese 171 1.2K mwl 152 1.1K mwl Modern Greek 5,479,180,137 62G el 2,412,419,435 27G el Mongolian 181,307,167 2.2G mn 68,362,013 838M mn Nahuatl languages 1,234 12K nah 1,193 11K nah Neapolitan 5,282 17K nap 4,147 13K nap Nepali 107,448,208 1.8G ne 71,628,317 1.2G ne Newari 564,697 5.5M new 288,995 4.1M new Northern Frisian 1,516 4.4K frr 1,516 4.4K frr Northern Luri 8,022 76K lrc 6,740 63K lrc Norwegian 1,344,326,388 8.0G no 804,894,377 4.7G no Norwegian Nynorsk 14,764,980 85M nn 9,435,139 54M nn Occitan 750,301 5.8M oc 512,678 3.7M oc Oriya 14,938,567 248M or 11,321,740 188M or Ossetian 1,031,268 13M os 878,765 11M os Pampanga 130 760 pam 52 304 pam Panjabi 61,847,806 763M pa 37,555,835 460M pa Persian 9,096,554,121 79G fa 4,363,505,319 38G fa Piemontese 362,013 2.1M pms 337,246 1.9M pms Polish 15,277,255,137 109G pl 6,708,709,674 47G pl Portuguese 20,641,903,898 124G pt 10,751,156,918 64G pt Pushto 46,559,441 361M ps 31,347,348 242M ps Quechua 10,186 78K qu 8,691 67K qu Romanian 3,984,317,058 25G ro 1,741,794,069 11G ro Romansh 1,093 7.4K rm 960 6.5K rm Russia Buriat 963 13K bxr 809 11K bxr Russian 92,522,407,837 1.2T ru 46,692,691,520 568G ru Sanskrit 4,331,569 93M sa 1,713,930 37M sa Scottish Gaelic 310,689 1.9M gd 207,110 1.3M gd Serbian 364,395,411 3.9G sr 207,561,168 2.2G sr Serbo-Croatian 5,292,184 25M sh 1,040,573 5.8M sh Sicilian 554 3.3K scn 468 2.8K scn Sindhi 43,530,158 347M sd 33,028,015 263M sd Sinhala 93,053,465 1.4G si 50,864,857 802M si Slovak 1,322,247,763 9.1G sk 656,346,179 4.5G sk Slovenian 387,399,700 2.5G sl 193,926,684 1.3G sl Somali 1,202 61K so 472 16K so South Azerbaijani 2,175,054 27M azb 1,528,709 19M azb Spanish 47,545,122,279 278G es 25,928,290,729 149G es Sundanese 30,321 211K su 20,278 141K su Swahili 2,211,927 13M sw 1,376,963 8.1M sw Swedish 7,155,994,312 44G sv 4,106,120,608 25G sv Tagalog 98,949,299 573M tl 70,121,601 407M tl Tajik 31,758,142 379M tg 21,029,893 249M tg Tamil 420,537,132 9.3G ta 226,013,330 5.1G ta Tatar 51,034,893 670M tt 23,825,695 305M tt Telugu 123,711,517 2.5G te 79,094,167 1.6G te Thai 951,743,087 36G th 368,965,202 16G th Tibetan 1,483,589 187M bo 936,556 138M bo Turkish 7,577,388,700 60G tr 3,365,734,289 27G tr Turkmen 1,113,869 11M tk 752,326 6.8M tk Tuvinian 759 12K tyv 540 7.9K tyv Uighur 8,657,141 122M ug 5,852,225 83M ug Ukrainian 4,204,381,276 53G uk 2,252,380,351 28G uk Upper Sorbian 545,351 4.2M hsb 236,867 1.8M hsb Urdu 331,817,982 2.7G ur 218,030,228 1.7G ur Uzbek 2,450,256 21M uz 1,381,644 12M uz Venetian 3,492 18K vec 3,199 17K vec Vietnamese 12,036,845,359 68G vi 5,577,159,843 32G vi Volap\u00fck 321,121 2.0M vo 318,568 2.0M vo Walloon 50,720 273K wa 37,543 203K wa Waray 397,315 2.5M war 336,311 2.2M war Welsh 37,422,441 213M cy 23,574,673 133M cy Western Frisian 5,691,077 35M fy 4,223,816 26M fy Western Mari 93,338 1.2M mrj 87,780 1.1M mrj Western Panjabi 1,426,986 12M pnb 1,111,112 9.0M pnb Wu Chinese 11,189 109K wuu 4,333 32K wuu Yakut 2,547,623 42M sah 1,789,174 26M sah Yiddish 13,834,320 141M yi 8,212,970 84M yi Yoruba 8,906 55K yo 3,518 27K yo Yue Chinese 186 3.7K yue 128 2.2K yue"},{"location":"versions/oscar-2019/#license","title":"License","text":"These data are released under this licensing scheme:
"},{"location":"versions/oscar-2019/#notice-and-take-down-policy","title":"Notice and take down policy","text":"
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
"},{"location":"versions/oscar-2019/#models","title":"Models","text":"Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:
Model Language Corpus Authors Paper Files License ELMo Bulgarian OSCAR Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 bg.zip MIT ELMo Bulgarian Wikipedia Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 bg.zip MIT ELMo Catalan OSCAR Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 ca.zip MIT ELMo Catalan Wikipedia Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 ca.zip MIT ELMo Danish OSCAR Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 da.zip MIT ELMo Danish Wikipedia Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 da.zip MIT ELMo French OSCAR Pedro J. Ortiz, Yoann Dupont, Benjamin Muller, Laurent Romary and Beno\u00eet Sagot LREC 2020 fr.zip MIT ELMo Finnish OSCAR Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 fi.zip MIT ELMo Finnish Wikipedia Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 fi.zip MIT ELMo Indonesian OSCAR Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 id.zip MIT ELMo Indonesian Wikipedia Pedro J. Ortiz, Beno\u00eet Sagot and Laurent Romary ACL 2020 id.zip MIT"},{"location":"versions/oscar-2019/#featured-models","title":"Featured Models","text":"Here is a list of Language models trained by the community:
Model Language Cased Corpus Authors Paper Website Files License AraBERT Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ACL Anthology GitHub Hugging Face N/A Arabic-BERT Arabic Cased OSCAR and Wikipedia Ali Safaya ArXiv GitHub Hugging Face MIT AraELECTRA Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ArXiV GitHub Hugging Face N/A AraGPT2 Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ArXiv GitHub Hugging Face N/A CamemBERT French Cased OSCAR Louis Martin, Benjamin Muller, Pedro Javier Ortiz Su\u00e1rez, Yoann Dupont, Laurent Romary, \u00c9ric Villemonte de la Clergerie, Djam\u00e9 Seddah and Beno\u00eet Sagot ACL 2020 camembert-model.fr camembert-base.tar.gz MIT CamemBERT French Cased Subsample of OSCAR (4 GB of text) Louis Martin, Benjamin Muller, Pedro Javier Ortiz Su\u00e1rez, Yoann Dupont, Laurent Romary, \u00c9ric Villemonte de la Clergerie, Djam\u00e9 Seddah and Beno\u00eet Sagot ACL 2020 camembert-model.fr camembert-base-oscar-4gb.tar.gz MIT LePetit French Cased Subsample of OSCAR (2 GB of text) Vincent Micheli, Martin d'Hoffschmidt, Quentin Heinrich Medium blog illuin.tech Hugging Face MIT GigaBERT Arabic Cased and Uncased OSCAR, Wikipedia, Gigaword Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter EMNLP 2020 GitHub Hugging Face MIT ELECTRA Norwegian Cased OSCAR and OPUS Viktor Alm N/A Hugging Face Hugging Face N/A BERT Romanian Cased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT BERT Romanian Uncased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT RoBERTa Sinhala N/A OSCAR Keshan Sodimana N/A Hugging Face Hugging Face N/A BERT Turkish Cased and Uncased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT ELECTRA Turkish Cased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT XLMIndic Hindi, Bengali, Gujarati, Panjabi, Marathi, Oriya, Assamese, Sinhala, Nepali, Bihari, Bishnupriya, Maithili, Goan Konkani, Sanskrit Cased OSCAR Ibraheem Muhammad Moosa, Mahmud Shimul and Ashfia Binte Habib Arxiv GitHub Hugging Face MITIf you have trained a model using the OSCAR corpus and would like to have it featured here, please open a pull request in our GitHub repo. Help us grow the community!
"},{"location":"versions/oscar-2109/","title":"OSCAR 21.09","text":""},{"location":"versions/oscar-2109/#features","title":"Features","text":"These are the versions of tooling, schemes and data
The new OSCAR schema incorporates backward-compatible changes.
"},{"location":"versions/oscar-2109/#changes_1","title":"Changes","text":"The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:
/\n\u251c\u2500\u2500 af\n\u2502 \u251c\u2500\u2500 af_sha256.txt\n\u2502 \u2514\u2500\u2500 af.txt.gz\n\u251c\u2500\u2500 de\n\u2502 \u251c\u2500\u2500 de_sha256.txt # Checksum file \n\u2502 \u2514\u2500\u2500 de.txt.gz # Textual content\n\u251c\u2500\u2500 en\n\u2502 \u251c\u2500\u2500 en_part_1.txt.gz # Multipart example\n\u2502 \u251c\u2500\u2500 en_part_2.txt.gz\n\u2502 \u2514\u2500\u2500 en_sha256.txt\n\u251c\u2500\u2500 yi\n\u2502 \u251c\u2500\u2500 yi_sha256.txt\n\u2502 \u2514\u2500\u2500 yi.txt.gz\n\u2514\u2500\u2500 zh\n \u251c\u2500\u2500 zh_sha256.txt\n \u2514\u2500\u2500 zh.txt.gz\n
The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):
/\n\u251c\u2500\u2500 af\n\u2502 \u251c\u2500\u2500 af_meta.jsonl.gz\n\u2502 \u251c\u2500\u2500 af_sha256.txt\n\u2502 \u2514\u2500\u2500 af.txt.gz\n\u251c\u2500\u2500 de\n\u2502 \u251c\u2500\u2500 de_meta.jsonl.gz # Metadata, in JSONLines format\n\u2502 \u251c\u2500\u2500 de_sha256.txt # Checksum file \n\u2502 \u2514\u2500\u2500 de.txt.gz # Textual content\n\u251c\u2500\u2500 en\n\u2502 \u251c\u2500\u2500 en_meta_part_1.jsonl.gz # Multipart example\n\u2502 \u251c\u2500\u2500 en_meta_part_2.jsonl.gz # Each part is independent,\n\u2502 \u251c\u2500\u2500 en_part_1.txt.gz # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz\n\u2502 \u251c\u2500\u2500 en_part_2.txt.gz\n\u2502 \u2514\u2500\u2500 en_sha256.txt\n\u251c\u2500\u2500 yi\n\u2502 \u251c\u2500\u2500 yi_meta.jsonl.gz\n\u2502 \u251c\u2500\u2500 yi_sha256.txt\n\u2502 \u2514\u2500\u2500 yi.txt.gz\n\u2514\u2500\u2500 zh\n \u251c\u2500\u2500 zh_meta.jsonl.gz\n \u251c\u2500\u2500 zh_sha256.txt\n \u2514\u2500\u2500 zh.txt.gz\n
"},{"location":"versions/oscar-2109/#file-formats","title":"File formats","text":""},{"location":"versions/oscar-2109/#txt-files","title":".txt
files","text":"Lines are newline-separated, and documents are double-newline separated. In other terms, there is a blank line between each document.
"},{"location":"versions/oscar-2109/#jsonl-files","title":".jsonl
files","text":"These are the metadata, in JSONLines format.
Each line follows the following JSON Scheme:
{\n\"$schema\": \"http://json-schema.org/draft-07/schema#\",\n\"title\": \"Metadata\",\n\"description\": \"Holds record headers.\\n\\nEach metadata is linked to a specific paragraph/text zone\",\n\"type\": \"object\",\n\"required\": [\n\"headers\",\n\"nb_sentences\",\n\"offset\"\n],\n\"properties\": {\n\"headers\": {\n\"type\": \"object\",\n\"additionalProperties\": {\n\"type\": \"string\"\n}\n},\n\"nb_sentences\": {\n\"type\": \"integer\",\n\"format\": \"uint\",\n\"minimum\": 0.0\n},\n\"offset\": {\n\"type\": \"integer\",\n\"format\": \"uint\",\n\"minimum\": 0.0\n}\n}\n}\n
Example:
{\n\"headers\":{ // these headers keys are *almost* always present.\n\"content-length\":\"11062\", // the content length is not changed and reflects the \n// length before filtering and eventual deduplication.\n\"warc-target-uri\":\"...\",\n\"warc-type\":\"conversion\",\n\"content-type\":\"text/plain\",\n\"warc-date\":\"2021-02-24T17:55:29Z\", // Following WARC specification, it is the crawl date.\n\"warc-identified-content-language\":\"eng,zho\",\n\"warc-refers-to\":\"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>\",\n\"warc-block-digest\":\"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X\",\n\"warc-record-id\":\"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>\"\n},\n\"offset\":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.\n\"nb_sentences\":9\n}\n
"},{"location":"versions/oscar-2109/#lang_sha256txt-files","title":"<lang>_sha256.txt
files","text":"These are used to check for eventual corruption during download. They can be used by running sha256sum -c <lang>_sha256.txt
.
[^1]: gsw
is ISO 639-2 for Alemannic German. It was previously identified as als
in previous OSCAR versions, due to a bug in fasttext. [^2]: eml
identification tag is deprecated and corresponds to rgn
and egl
tags in ISO 639-3
OSCAR 2201 is the OSCAR version from January, 2022, the November/December 2021 dump of Common Crawl. It features a different file layout that makes it not backward compatible with code designed to run with previous OSCAR versions.
Request access \ud83e\udd17 Datasets Read the paper
"},{"location":"versions/oscar-2201/#summary","title":"Summary","text":"OSCAR 22.01 is document-oriented, which means that rather than extracting lines and sorting them in language subcorpora, we identify documents as a whole. The main differences are that sentences in a document are contiguous and should make sense one after another, but sentences are not guaranteed to be of the subcorpus' language.
Note
As an example, the English Wikipedia page about La Marseillaise contains sentences in French (The anthem's lyrics). In line-oriented corpora, these sentences would have been put in the French subcorpus. In OSCAR 22.01, they should be along with the article, in a document classified as English.
"},{"location":"versions/oscar-2201/#layout","title":"Layout","text":"As previous corpora, there is one subcorpus per language, plus one new subcorpus for multilingual documents. Subcorpora are distributed in JSONLines, split into 1GB chunks, then gzipped.
Note
Splits are completely independent and self-contained: It is possible to only download en_meta_134.jsonl.gz
and to do processing on it.
{\n\"content\":\"newline\\nseparaaaaaaaaaaated\\ncontent\", // (1)\n\"warc_headers\":{ // (2) \n\"warc-refers-to\":\"<urn:uuid:83f2e1d4-5ed3-41db-86ff-f7826c4c20f9>\", \"warc-date\":\"2021-09-16T11:07:14Z\",\n\"warc-block-digest\":\"sha1:X3OWP47FG2O5LBNMFSNB44FJF2SSRC26\",\n\"warc-type\":\"conversion\",\n\"warc-identified-content-language\":\"eng\",\n\"content-length\":\"1694\",\n\"warc-target-uri\":\"https://foo.bar\",\n\"warc-record-id\":\"<urn:uuid:3304bc27-17d0-4ffd-a692-340381478a5f>\",\n\"content-type\":\"text/plain\"\n},\n\"metadata\":{\n// (3)\n\"identification\":{\n\"label\":\"en\",\n\"prob\":0.6268374\n},\n// (4)\n\"annotation\":[\n\"short_sentences\",\n\"footer\"\n],\n// (5)\n\"sentence_identifications\":[\n{\n\"label\":\"en\",\n\"prob\":0.93925816\n},\nnull,\n{\n\"label\":\"en\",\n\"prob\":0.9606543\n}\n]\n}\n}\n
\\n
.prob
is the weighted average of the confidence of identified lines.null
if no annotation.null
for each line that has no identification.tiny
: The document has a low (<5) number of lines.short_sentences
: The document has a high number (>50%) of short lines (<400 bytes)header
: The document has a high number of short lines at its head, suggesting the presence of low quality content.footer
: The document has a high number of short lines at its tail, suggesting the presence of low quality content.noisy
: The document has a high percentage of punctuation (>50%)adult
: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.More information about the thresholds and annotators are present in our paper.
"},{"location":"versions/oscar-2201/#filtering","title":"Filtering","text":"Tip
Filtering can be done using oscar-tools
, a high performance toolkit that provides rapid and efficient ways of transforming corpora into what you need. More info here.
Filtering can be done using classic Python tools, such as ujson
. While we don't supply a Python library enabling easy filtering/transformation for OSCAR 22.01, we provide some filtering examples that you can change to better suit your needs.
Using filters on warc_headers.warc-target-uri
makes filtering on URLs easy.
TODO\n
"},{"location":"versions/oscar-2201/#extracting-lines-from-non-annotated-documents","title":"Extracting lines from non-annotated documents","text":"Non-annotated documents are suspected to be cleaner than annotated ones, so extracting their content should be interesting to do. We extract lines from documents where metadata.annotations == null
.
TODO\n
"},{"location":"versions/oscar-2201/#getting-alemannic-lines-from-the-german-corpus","title":"Getting Alemannic lines from the German corpus","text":"As detailed in our paper, we found that the German corpus has a (relative to the Alemannic corpus size) important amount of Alemannic. We use a filter on metadata.sentence_identifications
to extract those sentences.
TODO\n
"},{"location":"versions/oscar-2201/#languages","title":"Languages","text":"OSCAR 22.01 has subcorpora for 142 languages (counting the Multilingual corpus). The following table exhibits the size, number of documents and number of words for each of them.
Note that the size accounts for the raw uncompressed file size, counting metadata.
Language table Language Size # Documents # Words Multilingual 12.1 GB 1,210,685 936,187,711 Afrikaans 47.0 MB 12,393 6,227,310 Albanian 3.0 GB 437,287 326,325,149 Alemannic / Swiss German 363.6 kB 139 37,381 Amharic 461.0 MB 37,513 30,481,153 Arabic 84.2 GB 8,718,929 6,103,711,887 Aragonese 10.6 kB 12 51 Armenian 4.7 GB 379,267 268,031,270 Assamese 221.2 MB 17,084 11,109,557 Asturian 73.6 kB 77 3,919 Avaric 18.6 kB 14 582 Azerbaijani 3.5 GB 491,847 291,927,692 Bangla 15.1 GB 1,171,501 751,877,226 Bashkir 95.5 MB 11,198 5,418,474 Basque 1.1 GB 233,658 97,092,942 Belarusian 1.8 GB 180,046 107,227,860 Bihari languages 24.2 kB 27 569 Bishnupriya 2.0 MB 271 98,419 Bosnian 10.3 kB 10 422 Breton 33.7 MB 16,119 3,111,619 Bulgarian 35.1 GB 2,887,115 2,405,981,285 Burmese 1.9 GB 158,733 44,835,970 Catalan 13.9 GB 2,627,307 1,508,919,864 Cebuano 44.6 MB 5,742 5,253,785 Central Kurdish 716.4 MB 84,950 43,913,025 Chechen 14.0 MB 4,086 798,766 Chinese 900.9 GB 56,524,518 23,149,203,886 Chuvash 41.8 MB 4,750 2,465,782 Cornish 1.4 kB 2 55 Croatian 11.2 MB 11,462 505,369 Czech 58.6 GB 10,381,916 5,452,724,456 Danish 12.6 GB 2,265,479 1,454,439,292 Dimli (individual language) 706 Bytes 1 19 Divehi 217.2 MB 24,067 10,112,205 Dutch 114.0 GB 20,206,532 12,329,127,151 Eastern Mari 11.3 MB 1,612 641,525 Egyptian Arabic 2.8 MB 1,256 176,096 English 3.2 TB 431,992,659 377,376,402,775 Esperanto 558.3 MB 111,932 58,416,628 Estonian 9.2 GB 1,362,524 820,975,443 Filipino 646.5 MB 70,394 81,881,278 Finnish 37.8 GB 4,948,961 2,900,615,928 French 382.2 GB 52,037,098 41,713,990,658 Galician 255.2 MB 88,803 27,051,212 Georgian 7.1 GB 488,588 281,430,479 German 496.7 GB 70,075,424 46,826,676,844 Goan Konkani 787.2 kB 46 38,831 Greek 78.3 GB 6,738,546 5,031,242,803 Guarani 9.0 kB 10 374 Gujarati 4.8 GB 136,467 301,170,777 Hebrew 30.3 GB 3,132,396 2,249,377,984 Hindi 23.3 GB 1,529,907 1,534,799,198 Hungarian 53.9 GB 6,866,062 4,598,787,907 Icelandic 2.0 GB 396,183 210,365,124 Ido 77.3 kB 105 2,690 Iloko 97.9 kB 75 8,592 Indonesian 17.4 GB 2,244,622 1,984,195,207 Interlingua 40.2 kB 6 10,125 Irish 45.6 MB 12,233 4,877,850 Italian 229.3 GB 28,502,092 24,294,684,830 Japanese 258.7 GB 36,328,931 5,592,948,356 Javanese 152.7 kB 70 10,441 Kalmyk 9.3 kB 9 250 Kannada 2.6 GB 150,850 108,450,571 Karachay-Balkar 119.6 kB 91 4,089 Kazakh 2.9 GB 261,085 157,267,307 Khmer 1.9 GB 121,910 30,564,131 Komi 119.9 kB 127 3,335 Korean 51.8 GB 5,881,481 3,854,968,649 Kurdish 150.3 MB 29,906 17,390,759 Kyrgyz 518.6 MB 62,244 28,028,986 Lao 337.1 MB 28,914 6,682,982 Latin 4.1 MB 4,397 187,446 Latvian 8.2 GB 1,032,987 707,361,898 Lezghian 375.5 kB 124 19,250 Limburgish 1.4 kB 2 41 Lithuanian 20.0 GB 2,303,070 1,712,802,056 Lojban 1.9 MB 570 260,542 Lombard 2.6 kB 2 225 Low German 9.0 MB 1,938 1,012,561 Lower Sorbian 707 Bytes 1 17 Luxembourgish 15.8 MB 5,108 1,545,946 Macedonian 3.6 GB 341,775 244,058,579 Maithili 21.6 kB 23 483 Malagasy 57.3 MB 3,028 7,279,056 Malay 5.3 MB 5,228 217,818 Malayalam 4.1 GB 250,972 137,831,247 Maltese 2.5 MB 2,208 118,190 Marathi 3.3 GB 250,376 160,179,233 Mazanderani 128.2 kB 76 7,337 Minangkabau 6.0 MB 585 614,613 Mingrelian 7.6 MB 2,550 253,333 Mongolian 2.8 GB 237,719 176,405,432 Nahuatl languages 8.7 kB 12 179 Nepali 3.7 GB 391,947 177,885,116 Newari 5.7 MB 1,134 273,837 Norwegian 2.8 GB 973,188 279,182,902 Norwegian Nynorsk 6.8 MB 5,835 459,183 Occitan 2.1 MB 373 31,061 Odia 487.9 MB 52,942 23,755,902 Ossetic 13.9 MB 3,560 800,430 Pashto 490.3 MB 50,312 46,293,249 Persian 77.4 GB 7,665,871 6,430,164,396 Piedmontese 1.7 MB 698 188,270 Polish 139.0 GB 19,301,137 12,584,498,906 Portuguese 170.3 GB 23,735,707 18,441,864,893 Punjabi 1.1 GB 68,094 70,068,604 Quechua 744 Bytes 1 14 Romanian 49.2 GB 4,624,764 5,261,803,995 Russia Buriat 32.9 kB 39 785 Russian 1.1 TB 76,060,844 62,811,122,663 Sakha 65.6 MB 6,284 3,473,813 Sanskrit 136.0 MB 4,472 5,671,369 Scottish Gaelic 137.7 kB 136 7,769 Serbian 6.9 GB 577,472 482,932,670 Serbian (Latin) 931.8 kB 738 92,875 Sicilian 1.5 kB 2 50 Sindhi 117.1 MB 15,516 10,685,611 Sinhala 2.0 GB 108,593 113,179,741 Slovak 16.5 GB 2,409,555 1,619,121,944 Slovenian 1.2 GB 351,894 118,400,246 Somali 2.1 kB 3 109 South Azerbaijani 14.1 MB 5,381 693,746 Spanish 381.9 GB 51,386,247 42,829,835,316 Sundanese 5.0 MB 263 547,145 Swahili 1.3 MB 462 123,050 Swedish 48.0 GB 7,541,278 5,078,331,128 Tajik 870.9 MB 46,366 56,627,727 Tamil 11.4 GB 556,772 452,343,748 Tatar 915.3 MB 76,398 51,875,265 Telugu 3.4 GB 249,756 137,752,065 Thai 66.1 GB 5,030,254 1,626,779,846 Tibetan 234.5 MB 18,683 2,286,269 Turkish 75.1 GB 10,826,031 6,421,221,358 Turkmen 4.4 MB 2,485 276,632 Ukrainian 48.8 GB 4,558,214 2,879,585,992 Emiliano-Romagnolo[eml] 901 Bytes 1 53 Upper Sorbian 132.8 kB 110 8,825 Urdu 3.4 GB 336,994 332,816,354 Uyghur 201.9 MB 18,556 11,240,889 Uzbek 19.9 MB 9,526 1,370,842 Vietnamese 98.9 GB 9,587,233 12,283,185,482 Volap\u00fck 825.9 kB 661 57,039 Walloon 105.7 kB 138 4,386 Waray 7.6 MB 933 830,872 Welsh 409.3 MB 90,378 49,488,495 Western Frisian 75.3 MB 21,946 6,357,929 Western Mari 743.5 kB 155 43,916 Western Panjabi 46.7 MB 6,790 4,060,419 Wu Chinese 137.2 kB 88 3,056 Yiddish 232.5 MB 23,418 15,809,780 Yoruba 24.7 kB 26 1,042 Multilingual 12.1 GB 1,210,685 936,187,711"},{"location":"versions/oscar-2301/","title":"OSCAR 23.01","text":"OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories. OSCAR 23.01 has also moved from gzip to Zstandard compression. You might already have zstd
installed on your system, but if not, please check the Zstandard website for installation instructions.
Tip
OSCAR 23.01 is similar to OSCAR 22.01. As such, please also check out the documentation for OSCAR 22.01 if you need detailed information about metadata.
"},{"location":"versions/oscar-2301/#access","title":"Access","text":"Note
If you already have access to the corpus, there's nothing to do! Go up in the file hierarchy on the link you've been given, and you should find the new corpus.
Access to the OSCAR Corpus changes depending on your status. More info on our dedicated page.
Getting access
"},{"location":"versions/oscar-2301/#new-features","title":"New Features","text":""},{"location":"versions/oscar-2301/#categories","title":"Categories","text":"OSCAR 22.01 leveraged the UT1 Blocklists project to attempt to classify some adult content present in OSCAR. The OSCAR 23.01 pipeline iterated on this to include all of the blocklists provided by UT1.
Warning
The UT1 Blocklists page lists all the categories along with a short description. We strongly encourage you to read the descriptions if you plan on using them. Please also note that these descriptions are in French. We're working on an English translation of them.
Note
A document can belong to multiple categories.
These categories are in a field that is at this path: metadata.categories
.
Example
{\n\"content\":\"foo\",\n\"metadata\": {\n// ...\n\"categories\": [\"blog\", \"news\"],\n// ...\n}\n// ...\n}\n
"},{"location":"versions/oscar-2301/#kenlm-based-adult-content-filtering","title":"KenLM-based Adult Content Filtering","text":"For a select number of subcorpora, a measure of perplexity has been added. This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult
annotation in OSCAR 22.01. In other terms, the lower it is, the more likely a given document contains harmful/adult content.
Danger
This feature can be considered as unstable/unsafe, since we also want to evaluate its impact on particular issues.
As such, we do not provide a boolean value indicating if a given document can be harmful/adult content, but rather the raw perplexity. We have found a threshold that works well in English, but encourage you to experiment with it and to report back your findings.
"},{"location":"versions/oscar-2301/#locality-sensitive-hashing","title":"Locality Sensitive Hashing","text":"We use TLSH to compute a hash for each document.
Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.
This can be used to do both exact- and near- deduplication. Same documents have same hashes (the reverse might not be true). So you only need to check for identity amongst documents with identical hashes. TLSH hashes can be compared to yield a distance metric. According to the original paper, a cutoff of < 40 yields a false positive rate of 0.07% and a detect rate of 49.6%, while a cutoff of < 100 yields a FP rate of 6.43% and detect rate of 94.5%. You should choose a value that meets your purposes.
The above is true for the default version of TLSH which is used in packages such as py-tlsh
. OSCAR 23.01 uses a TLSH with a hyperparameter of 256 buckets (Full hash), and 3 byte checksums (collision rate : 1 in 5800) instead of 1 byte checksums (collision rate : 1 in 24).
If you would like to use py-tlsh
, follow these instructions (You need CMake
installed to perform the necessary modifications and build):
# download py-tlsh source package\npip download python-tlsh\n# unpack the source tar.gz and enter the directory\ntar -xvf python-tlsh-4.5.0.tar.gz && cd python-tlsh-4.5.0\n# run the following command to implement the changes\n# alternatively, you can use vi or a text editor\n# change TLSH_BUCKETS_128 into TLSH_BUCKETS_256 and change TLSH_CHECKSUM_1B into TLSH_CHECKSUM_3B\nsed -i 's/set(TLSH_BUCKETS_128 1)/set(TLSH_BUCKETS_256 1)/g; s/set(TLSH_CHECKSUM_1B 1)/set(TLSH_CHECKSUM_3B 1)/g' CMakeLists.txt\n\n# build and activate pip venv if not already done\n# python3 -m venv ~/.venv\nsource ~/.venv/bin/activate\n# build and install the new py-tlsh\npython3 setup.py install\n
Hashes are at metadata.tlsh
.
metadata.annotations
has been renamed metadata.quality_warnings
, and only contains length based quality warnings (see the OSCAR 2201 documentation for details).als
has become gsw
. Previously, als
was erroneously used as the tag for Alemannic/Swiss German, whereas it is the tag for Tosk Albanian.eml
has become x-eml
. The eml
tag is deprecated and as such has been replaced by a private tag (x-eml
).{\n\"content\":\"English sentence\\nphrase en fran\u00e7ais\\n????????????\", // (1)\n\"warc_headers\":{ // (2)\n\"warc-identified-content-language\":\"fra,eng\",\n\"warc-target-uri\":\"https://fr.wikipedia.org/wiki/...\",\n\"warc-record-id\":\"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>\",\n\"warc-type\":\"conversion\",\n\"content-length\":\"35298\", // (3)\n\"warc-refers-to\":\"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>\",\n\"warc-block-digest\":\"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB\", // (3)\n\"warc-date\":\"2022-11-26T09:45:47Z\",\n\"content-type\":\"text/plain\"\n},\n\"metadata\":{\n\"identification\":{ // (4)\n\"label\":\"fr\",\n\"prob\":0.8938327\n},\n\"harmful_pp\":4063.1814, // (5)\n\"tlsh\":\"tlsh:T125315FF2B6088901EEA097015DB39B4600B...\", // (6)\n\"quality_warnings\":[ // (7)\n\"short_sentences\",\n\"header\",\n\"footer\"\n],\n\"categories\":[ // (8)\n\"examen_pix\",\n\"liste_bu\"\n],\n\"sentence_identifications\":[ // (9)\n{\n\"label\":\"fr\",\n\"prob\":0.99837273\n},\n{\n\"label\":\"en\",\n\"prob\":0.9992377\n},\nnull\n]\n}\n}\n
Some important notes:
warc_headers
are copied and content can be altered by Ungoliant at generation stage, content-length
and warc-block-digest
can be different from actual values.harmful_pp
to harmful_ppl
in future releases.annotations
pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.null
value means no identification with a good enough threshold (>0.8 on 23.01).This is currently preferred to just getting it from cargo install ungoliant
.
git clone https://github.com/oscar-project/ungoliant
compil
node: srun --partition=compil -A <GROUP ID>@cpu --pty bash
module load llvm boost cargo
(boost
and llvm
are necessary for compiling KenLM and FastText)cd ungoliant
cargo b --release --features kenlm
We advise the use of the prepost
partition for downloading the data form Common Crawl. However, please bear in mind that jobs are limited to 20hours in the prepost
partition, meaning that you'll likely run out of time before completing the download of a whole Common Crawl dump.
wet.paths.gz
file for the latest release (likely heregzip -d wet.paths.gz
Create a dl_corpus.slurm
file with the following text inside:
#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=get_cc # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="64" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <GROUP ID>@cpu
+
+export CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>
+export PATHS_FILE=<PATH TO wet.PATHS>
+export DST=<DESTINATION>
+
+
+./target/release/ungoliant download $PATHS_FILE $DST
+
When the time has run out, you have to ensure that the last downloaded shards weren't corrupted (because of a potential kill while downloading).
+Then, after potentially removing faulty shards, run the following slurm job.
+The only difference with the previous one is the use of the -o n
parameter on ungoliant download
, which will ignore the first n
lines of the wet.paths
.
+You can/should also use another DESTINATION
folder, and then do the merge by hand.
#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=get_cc # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="64" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <GROUP ID>@cpu
+
+export CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>
+export PATHS_FILE=<PATH TO wet.PATHS>
+export DST=<DESTINATION>
+
+
+./target/release/ungoliant download -o <NB_DOWNLOADED> $PATHS_FILE $DST
+
You can then check that no shards are missing:
+import os
+
+shards_dir = "./shards"
+paths_file = "wet.paths"
+cc_rooturl = "https://data.commoncrawl.org/"
+
+missing_shards = list()
+for i in range(88000):
+ if not os.path.isfile(f"{shards_dir}/{i}.txt.gz"):
+ missing_shards.append(i)
+print(f"missing {len(missing_shards)} shards")
+
+with open(paths_file) as f:
+ shard_paths = f.readlines()
+ for missing_shard_number in missing_shards:
+ print(
+ f"wget -nc {cc_rooturl}{shard_paths[missing_shard_number].strip()} -O {missing_shard_number}.txt.gz"
+ )
+
This will give you the wget
commands to get the missing shards, with a -nc
param to avoid overwriting already existing files.
When you have your shards ready, create a new SLURM file with:
+We use a QoS of t4 because since we can only use one node and corpus generation time is likely >20h, we need the 100 mark.
+Other strategies could be tested (for example, splitting CC data into 4 buckets and launch 4 ungoliant
jobs.
+Then, merging back the datasets should be done.
+Note that in that case, rebuild files will be less efficient (since we'll have 4 of them)
#! /bin/bash
+
+#SBATCH --partition=cpu_p1
+#SBATCH --job-name=gen_oscar # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="40" # nombre de coeurs à réserver par tâche
+#SBATCH --time="100:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH --qos=qos_cpu-t4
+#SBATCH -A <GROUP ID>@cpu
+
+export CARGO_HOME=<CARGO HOME PATH>
+export CC_FOLDER=<SHARDS PATH>
+export KENLM_FOLDER=<PATH TO KENLMS MODELS IF APPLICABLE>
+export CORPUS=<DESTINATION FOLDER>
+export BLOCKLIST=<BLOCKLIST FOLDER (must contain subfolders with category names..)>
+export LID_PATH=<PATH TO FASTTEXT LangID>
+export UNGOLIANT_PATH=<PATH TO UNGOLIANT BINARY>
+
+$UNGOLIANT_PATH pipeline $CC_FOLDER $CORPUS --blocklist-path $BLOCKLIST --kenlms-path $KENLM_FOLDER --lid-path $LID_PATH
+
As of Jan. 2023, using ungoliant 1.3.0 ([c14acc8](https://github.com/oscar-project/ungoliant/tree/c14acc8c6a87913d138a022cf4819024d66b3e06))
, with a 88,000-shard dump of CommonCrawl (November/December 2022, ~9.5TB compressed), this process took around 20 hours and yielded a corpus weighing arount 12TB (uncompressed).
Files in $SCRATCH
are deleted after 30 days if no R/W is operated on them. You should move out files to $STORE
if you plan on keeping them.
+Unfortunately, due to the file size, you'll need to launch another job to do the copying of the files.
Warning
+rsync -n
enables a dry-run, enabling you to see which files would be moved, and where. Remove the -n
parameter when you want to perform the actual copy.
#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=copy_oscar # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=julien.abadji@inria.fr # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="4" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <GROUP ID>@cpu
+
+export SRC=<CORPUS SOURCE>
+export DST=<CORPUS DESTINATION>
+
+rsync -anvP $SRC $DST
+
On the same example as before, copying took around 9 hours.
+We use oscar-tools
to split the corpus.
Note
+At the time of writing, oscar-tools
is not available via crates.io/cargo install
, so you have to compile it from source. Luckily, it's easy.
oscar-tools
git clone https://github.com/oscar-project/oscar-tools
compil
node: srun --partition=compil -A <GROUP ID>@cpu --pty bash
cd oscar-tools
CARGO_HOME=<Somewhere not in your ~, like $SCRATCH/.cargo> cargo b --features zstd --release
.target/release/oscar-tools
.#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=split_oscar # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<Your email address> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="10" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <group id>@cpu
+
+export OSCAR_TOOLS_BIN=<path to oscar-tools binary>
+export CORPUS=<path to corpus>
+export DST=<where the split corpus will be put>
+
+$OSCAR_TOOLS_BIN v2 split $CORPUS $DST -s 2000
+
This step took around 3 hours (assuming both CORPUS
and DST
are on $SCRATCH
).
#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=compress_oscar # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<email address> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="48" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <group id>@cpu
+
+export OSCAR_TOOLS_BIN=<link to oscar-tools binary>
+export CORPUS=<path to split focus>
+export DST=<where the compressed ocrpus will be saved>
+
+$OSCAR_TOOLS_BIN v2 compress $CORPUS $DST
+
This step took around 2 hours, going from 12TB to 3.3TB
+The last step is to create checksum
files for each language, so that people can check that their downloads have been successful.
+Also, it acts as a split list for download-oscar.
#! /bin/bash
+
+#SBATCH --partition=prepost
+#SBATCH --job-name=compress_oscar # create a short name for your job
+#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
+#SBATCH --mail-user=<email address> # Where to send mail
+#SBATCH --nodes="1" #Combien de nœuds
+#SBATCH --ntasks-per-node="1" # Une tâche par GPU
+#SBATCH --cpus-per-task="48" # nombre de coeurs à réserver par tâche
+#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
+#SBATCH -A <group id>@cpu
+
+export OSCAR_TOOLS_BIN=<link to oscar-tools binary>
+export CORPUS=<path to split focus>
+
+$OSCAR_TOOLS_BIN v2 checksum $CORPUS
+
The process took around 2 hours.
+ + + + + + +oscar-tools
is a toolkit that was created along with OSCAR-2201 to make operations on the corpus easy and fast.
At its core, oscar-tools
provides a set of operations targeted at a given OSCAR version. As such, you shoudn't expect to have all operations available on all OSCAR versions. For example, at the time of writing, deduplicate
is not available for OSCAR 22.01-like corpora.
The CLI of oscar-tools
is still a bit messy and can be confusing, because we are actively working on it and on implementing essential features.
releases
Note
+Binaries are not available yet.
+cargo
Note
+cargo install oscar-tools
is not available yet.
Note
+This could evolve rapidly.
+Right now the latest version sits on the dev-oscario
branch, where we're slowly replacing inline IO blocks by our Corpus IO library, oscar-io
.
$> git clone https://github.com/oscar-corpus/oscar-tools #Clone the repository
+$> cd oscar-tools
+$> git checkout dev-oscario #Change branch
+$> cargo b --release #Build the project.
+$> # Building might take some time because of
+$> # the parquet dependency that will soon be optional.
+$> touch target/release/oscar-tools #Binary is here and self-sufficient.
+
oscar-tools --help
might help you find the parameters/operations you're looking for.
Note
+In the tool, v1
corresponds to 2019-like corpora, whereas v2
corresponds to 22.01-like corpora.
Each operation has different parameters.
+At the time of writing, the only operation available is dedup
. It uses runiq
to deduplicate corpora.
oscar-tools-v1-dedup
+line deduplication
+
+USAGE:
+ oscar-tools v1 dedup [ARGS]
+
+ARGS:
+ <SOURCE> Corpus source file.
+ <DESTINATION> Corpus destination file. Should not exist.
+
+OPTIONS:
+ -h, --help Print help information
+
There is a lot more operations implemented on OSCAR 22.01-like corpora.
+extract-tags
extract-tags
extracts documents that meet certain annotation constraints.
oscar-tools-v2-extract-tags
+Extracts a OSCAR v2 corpus restricting tags. Included tags must be present and excluded ones must be
+absent. Use --clean to extract documents with no annotation only
+
+USAGE:
+ oscar-tools v2 extract-tags [OPTIONS] [--] [ARGS]
+
+ARGS:
+ <SOURCE> Corpus source file/folder. If folder, splits corpus files in provided
+ folder
+ <DESTINATION> Corpus source file/folder. If folder, splits corpus files in provided
+ folder
+
+OPTIONS:
+ --clean only return documents with no tags. include and exclude will be
+ ignored
+ -e, --exclude <tags>... space separated tags to exclude.
+ -h, --help Print help information
+ -i, --include <tags>... space separated tags to include.
+
extract-text
extract-text
"converts" a 2201-like corpus into a 2019-like corpus, by removing all metadata and only storing sentences. Keep in mind that while the format will be similar to 2109-like corpora, the filtering is a bit different and lines from other languages won't be stripped.
Extract text from documents. The output will be a OSCAR v1 (2019)-compatible corpus.
+
+USAGE:
+ oscar-tools v2 extract-text [OPTIONS] <SOURCE> <DESTINATION>
+
+ARGS:
+ <SOURCE> Corpus source file.
+ <DESTINATION> Corpus destination file (OSCAR v1 (2019)-like)
+
+OPTIONS:
+ --del_src If set, deletes source files as they are being extracted.
+ -h, --help Print help information
+
mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality.
+Access to the mOSCAR is granted via the Hugging Face Hub.
+All data is avaialble at https://huggingface.co/datasets/oscar-corpus/mOSCAR.
+To Come ...
+Lang. name | +Code | +Family | +Script | +# documents | +# images | +# tokens | +
---|---|---|---|---|---|---|
Acehnese | +ace_Latn | ++ | Latin | +7,803 | +32,461 | +2,889,134 | +
Mesopotamian Arabic | +acm_Arab | ++ | Arabic | +2,274 | +10,620 | +1,047,748 | +
Tunisian Arabic | +aeb_Arab | ++ | Arabic | +7,640 | +41,570 | +2,715,187 | +
Afrikaans | +afr_Latn | ++ | Latin | +54,895 | +247,774 | +39,956,585 | +
South Levantine Arabic | +ajp_Arab | ++ | Arabic | +12,098 | +87,837 | +5,167,813 | +
Tosk Albanian | +als_Latn | ++ | Latin | +861,678 | +2,569,164 | +452,737,251 | +
Amharic | +amh_Ethi | ++ | Ge'ez | +39,588 | +152,646 | +35,089,019 | +
North Levantine Arabic | +apc_Arab | ++ | Arabic | +19,904 | +128,966 | +9,560,701 | +
Modern Standard Arabic | +arb_Arab | ++ | Arabic | +3,936,851 | +15,126,931 | +3,401,919,964 | +
Najdi Arabic | +ars_Arab | ++ | Arabic | +60,229 | +296,741 | +43,610,873 | +
Moroccan Arabic | +ary_Arab | ++ | Arabic | +142,386 | +698,051 | +204,723,454 | +
Egyptian Arabic | +arz_Arab | ++ | Arabic | +835,529 | +4,054,632 | +653,626,387 | +
Assamese | +asm_Beng | ++ | Bengali | +3,948 | +9,210 | +640,390 | +
Asturian | +ast_Latn | ++ | Latin | +165,745 | +962,723 | +37,547,944 | +
Awadhi | +awa_Deva | ++ | Devanagari | +29,324 | +107,483 | +4,961,635 | +
Central Aymara | +ayr_Latn | ++ | Latin | +27,384 | +151,889 | +5,148,970 | +
South Azerbaijani | +azb_Arab | ++ | Arabic | +8,274 | +38,233 | +5,256,693 | +
North Azerbaijani | +azj_Latn | ++ | Latin | +516,021 | +1,808,060 | +257,825,849 | +
Bashkir | +bak_Cyrl | ++ | Cyrillic | +4,532 | +17,174 | +3,038,766 | +
Bambara | +bam_Latn | ++ | Latin | +7,674 | +39,190 | +1,243,332 | +
Balinese | +ban_Latn | ++ | Latin | +1,886 | +11,266 | +542,015 | +
Belarusian | +bel_Cyrl | ++ | Cyrillic | +63,309 | +287,539 | +72,976,520 | +
Bemba | +bem_Latn | ++ | Latin | +1,096 | +7,479 | +1,340,471 | +
Bengali | +ben_Beng | ++ | Bengali | +270,406 | +947,035 | +35,858,814 | +
Bhojpuri | +bho_Deva | ++ | Devanagari | +6,366 | +28,131 | +875,463 | +
Banjar | +bjn_Latn | ++ | Latin | +5,427 | +27,803 | +1,898,526 | +
Bosnian | +bos_Latn | ++ | Latin | +1,960,599 | +7,633,049 | +1,255,000,505 | +
Buginese | +bug_Latn | ++ | Latin | +3,312 | +18,648 | +588,678 | +
Bulgarian | +bul_Cyrl | ++ | Cyrillic | +2,591,998 | +11,670,028 | +1,760,971,620 | +
Catalan | +cat_Latn | ++ | Latin | +1,153,864 | +4,736,634 | +606,447,390 | +
Cebuano | +ceb_Latn | ++ | Latin | +16,990 | +91,234 | +10,748,818 | +
Czech | +ces_Latn | ++ | Latin | +3,918,837 | +13,291,309 | +2,823,172,996 | +
Central Kurdish | +ckb_Arab | ++ | Arabic | +36,725 | +136,566 | +22,322,689 | +
Crimean Tatar | +crh_Latn | ++ | Latin | +6,376 | +24,124 | +1,742,727 | +
Welsh | +cym_Latn | ++ | Latin | +40,408 | +165,897 | +27,748,345 | +
Danish | +dan_Latn | ++ | Latin | +2,076,298 | +9,559,600 | +1,238,277,499 | +
German | +deu_Latn | ++ | Latin | +20,662,696 | +87,976,200 | +8,544,986,218 | +
Southwestern Dinka | +dik_Latn | ++ | Latin | +1,712 | +6,635 | +1,319,943 | +
Greek | +ell_Grek | ++ | Greek | +4,916,081 | +15,209,058 | +2,923,201,041 | +
English | +eng_Latn | ++ | Latin | +52,215,013 | +207,904,315 | +33,570,108,782 | +
Esperanto | +epo_Latn | ++ | Latin | +25,157 | +124,996 | +28,586,195 | +
Estonian | +est_Latn | ++ | Latin | +1,040,368 | +5,217,366 | +619,215,048 | +
Basque | +eus_Latn | ++ | Latin | +849,043 | +3,445,539 | +277,145,498 | +
Faroese | +fao_Latn | ++ | Latin | +15,411 | +60,340 | +6,691,327 | +
Fijian | +fij_Latn | ++ | Latin | +1,528 | +8,776 | +487,388 | +
Finnish | +fin_Latn | ++ | Latin | +2,396,033 | +10,365,333 | +1,781,044,864 | +
French | +fra_Latn | ++ | Latin | +20,305,739 | +78,179,601 | +14,362,579,829 | +
Friulian | +fur_Latn | ++ | Latin | +37,290 | +256,456 | +5,949,600 | +
Nigerian Fulfulde | +fuv_Latn | ++ | Latin | +1,568 | +7,124 | +401,852 | +
West Central Oromo | +gaz_Latn | ++ | Latin | +4,058 | +11,763 | +1,786,093 | +
Scottish Gaelic | +gla_Latn | ++ | Latin | +29,710 | +153,249 | +14,605,090 | +
Irish | +gle_Latn | ++ | Latin | +68,858 | +315,132 | +47,438,400 | +
Galician | +glg_Latn | ++ | Latin | +518,973 | +2,381,475 | +217,063,180 | +
Guarani | +grn_Latn | ++ | Latin | +490,945 | +2,416,633 | +89,921,114 | +
Gujarati | +guj_Gujr | ++ | Gujarati | +23,062 | +91,320 | +3,324,866 | +
Haitian Creole | +hat_Latn | ++ | Latin | +257,745 | +1,570,699 | +62,847,106 | +
Hausa | +hau_Latn | ++ | Latin | +25,364 | +104,934 | +13,089,932 | +
Hebrew | +heb_Hebr | ++ | Hebrew | +1,109,591 | +4,766,483 | +893,327,320 | +
Hindi | +hin_Deva | ++ | Devanagari | +579,430 | +1,830,667 | +122,558,353 | +
Chhattisgarhi | +hne_Deva | ++ | Devanagari | +1,581 | +7,263 | +273,174 | +
Croatian | +hrv_Latn | ++ | Latin | +1,719,617 | +8,425,510 | +1,010,674,096 | +
Hungarian | +hun_Latn | ++ | Latin | +3,534,506 | +15,390,083 | +2,831,715,050 | +
Armenian | +hye_Armn | ++ | Armenian | +339,962 | +1,141,885 | +205,635,952 | +
Igbo | +ibo_Latn | ++ | Latin | +11,529 | +68,049 | +8,701,070 | +
Ilocano | +ilo_Latn | ++ | Latin | +78,872 | +523,195 | +8,116,113 | +
Indonesian | +ind_Latn | ++ | Latin | +7,016,291 | +17,324,777 | +3,981,843,468 | +
Icelandic | +isl_Latn | ++ | Latin | +244,676 | +1,027,465 | +137,015,973 | +
Italian | +ita_Latn | ++ | Latin | +12,937,153 | +47,476,971 | +8,311,790,842 | +
Javanese | +jav_Latn | ++ | Latin | +24,785 | +135,583 | +16,908,805 | +
Japanese | +jpn_Jpan | ++ | Kanji | +14,415,292 | +23,893,768 | +8,923,348,944 | +
Kabyle | +kab_Latn | ++ | Latin | +18,508 | +106,730 | +4,079,553 | +
Kannada | +kan_Knda | +Brahmic | +Kannada | +12,978 | +42,621 | +1,442,776 | +
Kashmiri | +kas_Arab | ++ | Arabic | +3,109 | +11,408 | +5,731,910 | +
Georgian | +kat_Geor | +Caucasian | +Georgian | +354,436 | +1,304,281 | +275,223,026 | +
Kazakh | +kaz_Cyrl | ++ | Cyrillic | +252,242 | +732,648 | +140,049,214 | +
Halh Mongolian | +khk_Cyrl | ++ | Cyrillic | +124,412 | +508,217 | +84,535,241 | +
Khmer | +khm_Khmr | +Austroasiatic | ++ | 24,495 | +122,243 | +3,043,925 | +
Kinyarwanda | +kin_Latn | ++ | Latin | +30,401 | +172,201 | +12,049,616 | +
Kyrgyz | +kir_Cyrl | ++ | Cyrillic | +53,010 | +199,713 | +34,404,281 | +
Northern Kurdish | +kmr_Latn | ++ | Latin | +39,262 | +164,666 | +23,834,960 | +
Korean | +kor_Hang | ++ | Hanja | +2,614,089 | +13,563,283 | +2,006,080,705 | +
Lao | +lao_Laoo | ++ | + | 50,611 | +208,768 | +31,029,380 | +
Ligurian | +lij_Latn | ++ | Latin | +8,751 | +56,266 | +2,958,179 | +
Limburgish | +lim_Latn | ++ | Latin | +189,547 | +1,076,047 | +42,534,327 | +
Lingala | +lin_Latn | ++ | Latin | +24,614 | +152,132 | +4,053,459 | +
Lithuanian | +lit_Latn | ++ | Latin | +1,688,811 | +8,869,443 | +1,161,476,040 | +
Lombard | +lmo_Latn | ++ | Latin | +30,506 | +151,855 | +9,058,614 | +
Latgalian | +ltg_Latn | ++ | Latin | +11,948 | +61,624 | +4,148,492 | +
Luxembourgish | +ltz_Latn | ++ | Latin | +44,987 | +246,346 | +16,676,872 | +
Ganda | +lug_Latn | ++ | Latin | +1,878 | +7,215 | +789,917 | +
Mizo | +lus_Latn | ++ | Latin | +7,880 | +26,817 | +4,978,472 | +
Standard Latvian | +lvs_Latn | ++ | Latin | +896,243 | +4,141,648 | +587,653,855 | +
Magahi | +mag_Deva | ++ | Devanagari | +1,097 | +3,847 | +205,763 | +
Malayalam | +mal_Mlym | ++ | + | 14,140 | +52,679 | +1,689,010 | +
Marathi | +mar_Deva | ++ | Devanagari | +50,391 | +163,868 | +6,689,250 | +
Minangkabau | +min_Latn | ++ | Latin | +9,341 | +35,309 | +1,256,931 | +
Macedonian | +mkd_Cyrl | ++ | Cyrillic | +542,250 | +1,853,070 | +307,232,151 | +
Maltese | +mlt_Latn | ++ | Latin | +120,888 | +709,242 | +36,097,957 | +
Maori | +mri_Latn | ++ | Latin | +24,322 | +130,137 | +24,957,914 | +
Burmese | +mya_Mymr | ++ | + | 8,144 | +44,188 | +539,527 | +
Dutch | +nld_Latn | ++ | Latin | +17,096,727 | +65,606,013 | +9,670,041,731 | +
Norwegian Nynorsk | +nno_Latn | ++ | Latin | +199,355 | +1,012,313 | +67,799,774 | +
Norwegian Bokmal | +nob_Latn | ++ | Latin | +2,229,702 | +9,698,128 | +1,294,178,095 | +
Nepali | +npi_Deva | ++ | Devanagari | +31,239 | +127,193 | +3,138,539 | +
Nyanja | +nya_Latn | ++ | Latin | +12,047 | +67,192 | +8,596,769 | +
Occitan | +oci_Latn | ++ | Latin | +164,852 | +671,881 | +59,309,549 | +
Odia | +ory_Orya | ++ | + | 4,319 | +15,574 | +378,635 | +
Pangasinan | +pag_Latn | ++ | Latin | +4,214 | +32,287 | +546,071 | +
Eastern Panjabi | +pan_Guru | ++ | + | 11,497 | +46,168 | +1,887,991 | +
Papiamento | +pap_Latn | ++ | Latin | +55,224 | +363,015 | +10,002,655 | +
Southern Pasto | +pbt_Arab | ++ | Arabic | +32,604 | +110,807 | +29,170,322 | +
Western Persian | +pes_Arab | ++ | Arabic | +7,048,946 | +25,200,571 | +6,210,479,015 | +
Plateau Malgasy | +plt_Latn | ++ | Latin | +32,521 | +120,673 | +29,263,848 | +
Polish | +pol_Latn | ++ | Latin | +14,549,605 | +60,639,244 | +11,104,144,109 | +
Portuguese | +por_Latn | ++ | Latin | +8,145,664 | +26,530,423 | +4,760,063,083 | +
Dari | +prs_Arab | ++ | Arabic | +515,041 | +2,589,859 | +517,053,967 | +
Ayacucho Quechua | +quy_Latn | ++ | Latin | +1,578 | +11,817 | +362,690 | +
Romanian | +ron_Latn | ++ | Latin | +5,180,171 | +17,964,048 | +3,548,291,261 | +
Rundi | +run_Latn | ++ | Latin | +20,001 | +67,096 | +8,686,054 | +
Russian | +rus_Cyrl | ++ | Cyrillic | +15,913,845 | +69,542,828 | +18,909,213,208 | +
Sango | +sag_Latn | ++ | Latin | +2,124 | +13,556 | +454,455 | +
Sicilian | +scn_Latn | ++ | Latin | +73,199 | +424,362 | +27,110,743 | +
Sinhala | +sin_Sinh | ++ | + | 58,767 | +221,183 | +14,270,972 | +
Slovak | +slk_Latn | ++ | Latin | +3,008,599 | +15,067,234 | +1,963,804,563 | +
Slovenian | +slv_Latn | ++ | Latin | +1,472,025 | +7,210,285 | +935,834,754 | +
Samoan | +smo_Latn | ++ | Latin | +12,346 | +71,359 | +14,954,824 | +
Shona | +sna_Latn | ++ | Latin | +12,698 | +68,782 | +6,112,600 | +
Sindhi | +snd_Arab | ++ | Arabic | +21,095 | +74,289 | +17,647,825 | +
Somali | +som_Latn | ++ | Latin | +77,343 | +301,429 | +34,554,975 | +
Southern Sotho | +sot_Latn | ++ | Latin | +7,718 | +43,146 | +6,156,450 | +
Spanish | +spa_Latn | ++ | Latin | +22,713,366 | +78,361,087 | +14,616,773,475 | +
Sardinian | +srd_Latn | ++ | Latin | +675,539 | +4,059,493 | +106,159,957 | +
Serbian | +srp_Cyrl | ++ | Cyrillic | +604,557 | +2,286,171 | +401,223,741 | +
Sundanese | +sun_Latn | ++ | Latin | +44,310 | +236,025 | +13,627,832 | +
Swedish | +swe_Latn | ++ | Latin | +3,302,730 | +10,860,518 | +1,779,284,152 | +
Swahili | +swh_Latn | ++ | Latin | +137,134 | +593,418 | +59,454,896 | +
Silesian | +szl_Latn | ++ | Latin | +23,535 | +132,459 | +5,996,972 | +
Tamil | +tam_Taml | +Dravidian | +Tamil | +36,196 | +167,669 | +4,834,946 | +
Tatar | +tat_Cyrl | ++ | Cyrillic | +37,188 | +143,842 | +22,831,350 | +
Telugu | +tel_Telu | +Brahmic | +Telugu | +22,974 | +81,033 | +2,273,772 | +
Tajik | +tgk_Cyrl | ++ | Cyrillic | +125,236 | +417,591 | +90,503,778 | +
Tagalog | +tgl_Latn | ++ | Latin | +151,437 | +673,814 | +97,708,639 | +
Thai | +tha_Thai | ++ | Thai | +2,983,837 | +11,621,786 | +2,839,211,104 | +
Tigrinya | +tir_Ethi | ++ | Ge'ez | +2,657 | +8,707 | +1,725,422 | +
Tok Pisin | +tpi_Latn | ++ | Latin | +5,063 | +35,169 | +460,853 | +
Turkmen | +tuk_Latn | ++ | Latin | +13,024 | +57,354 | +9,766,999 | +
Turkish | +tur_Latn | ++ | Latin | +4,478,700 | +12,401,091 | +2,394,669,068 | +
Twi | +twi_Latn | ++ | Latin | +3,305 | +13,634 | +495,220 | +
Uyghur | +uig_Arab | ++ | Arabic | +10,713 | +41,709 | +6,785,318 | +
Ukrainian | +ukr_Cyrl | ++ | Cyrillic | +2,721,424 | +10,929,796 | +1,928,351,595 | +
Urdu | +urd_Arab | ++ | Arabic | +407,098 | +1,239,125 | +242,007,283 | +
Northern Uzbek | +uzn_Latn | ++ | Latin | +156,632 | +798,155 | +89,022,562 | +
Venetian | +vec_Latn | ++ | Latin | +330,611 | +1,830,777 | +71,077,531 | +
Vietnamese | +vie_Latn | ++ | Latin | +12,621,521 | +47,411,488 | +11,616,191,199 | +
Wolof | +wol_Latn | ++ | Latin | +4,658 | +20,380 | +1,596,432 | +
Xhosa | +xho_Latn | ++ | Latin | +25,950 | +142,387 | +15,809,823 | +
Eastern Yiddish | +ydd_Hebr | ++ | + | 12,486 | +57,510 | +17,369,727 | +
Yoruba | +yor_Latn | ++ | Latin | +56,700 | +286,933 | +32,614,558 | +
Yue Chinese | +yue_Hant | ++ | + | 33,671 | +203,513 | +24,172,441 | +
Chinese (Simplified) | +zho_Hans | ++ | Hanzi | +9,861,262 | +36,152,754 | +8,078,842,701 | +
Chinese (Traditional) | +zho_Hant | ++ | Hant | +3,967,966 | +16,307,258 | +2,962,854,441 | +
Standard Malay | +zsm_Latn | ++ | Latin | +1,179,744 | +5,488,632 | +432,667,199 | +
Zulu | +zul_Latn | ++ | Latin | +30,717 | +156,639 | +11,345,288 | +
OSCAR 2019 is the original 2019 release of the OSCAR corpus. +It has been generated from Common Crawl corpus using the goclassy architecture.
+OSCAR 2019 is shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.
+Data is distributed by language in both original and deduplicated form.
+If you need the unshuffled version of OSCAR, please contact us using the contact form. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. You can also download it using HuggingFace’s datasets library.
+Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.
+If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:
+@inproceedings{ortiz-suarez-etal-2020-monolingual,
+ title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
+ author = "Ortiz Su{\'a}rez, Pedro Javier and
+ Romary, Laurent and
+ Sagot, Beno{\^\i}t",
+ booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
+ month = jul,
+ year = "2020",
+ address = "Online",
+ publisher = "Association for Computational Linguistics",
+ url = "https://www.aclweb.org/anthology/2020.acl-main.156",
+ pages = "1703--1714",
+ abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
+}
+
If you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.
+{{% callout note %}} +The unshuffled OSCAR is now available in HuggingFace’s datasets library +{{% /callout %}} +They have obtained our permission to redistribute the unshuffled OSCAR and they allow users to download a corpus all at once as opposed to file by file. You can get more information about how to download OSCAR using their library by visiting OSCAR's dataset card.
+All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. Languages are split in shards of around 700MB, these shards are standalone. A plain text file with checksums is also provided.
+The OSCAR corpus is yet to be filtered, so please be careful when using it, specially for text generation tasks! To see which sub-corpora have been audited, please refer to the list of publications above for more information.
+You'll be asked to create an HumanID account in order to download a corpus. This is intended, and we do it in order to limit traffic and reduce abuse of the infrastructure. The OSCAR corpus is hosted by Huma-Num, you can read more about them on their website.
+All sizes are for the uncompressed files.
+Language | +Words original | +Size original | +File original | +Words deduplicated | +Size deduplicated | +File deduplicated | +
---|---|---|---|---|---|---|
Afrikaans | +43,482,801 | +241M | +af | +29,533,437 | +163M | +af | +
Albanian | +374,196,110 | +2.3G | +sq | +186,856,699 | +1.2G | +sq | +
Alemannic | +841,750 | +5.0M | +als | +459,001 | +2.8M | +als | +
Amharic | +28,301,601 | +360M | +am | +16,086,628 | +206M | +am | +
Arabic | +8,117,162,828 | +82G | +ar | +3,171,221,354 | +32G | +ar | +
Aragonese | +52,896 | +1.3M | +an | +45,669 | +801K | +an | +
Armenian | +273,919,388 | +3.7G | +hy | +110,196,043 | +1.5G | +hy | +
Assamese | +6,956,663 | +113M | +as | +4,366,570 | +71M | +as | +
Asturian | +381,005 | +2.4M | +ast | +325,237 | +2.0M | +ast | +
Avaric | +24,720 | +409K | +av | +19,478 | +324K | +av | +
Azerbaijani | +322,641,710 | +2.8G | +az | +167,742,296 | +1.5G | +az | +
Bashkir | +9,796,764 | +128M | +ba | +6,922,589 | +90M | +ba | +
Basque | +120,456,652 | +848M | +eu | +45,359,710 | +342M | +eu | +
Bavarian | +399 | +503 | +bar | +399 | +503 | +bar | +
Belarusian | +144,579,630 | +1.8G | +be | +83,499,037 | +1.1G | +be | +
Bengali | +623,575,733 | +11G | +bn | +363,766,143 | +5.8G | +bn | +
Bihari | +8,848 | +110K | +bh | +2,875 | +34K | +bh | +
Bishnupriya | +198,286 | +4.1M | +bpy | +96,940 | +1.7M | +bpy | +
Bosnian | +106,448 | +447K | +bs | +20,485 | +116K | +bs | +
Breton | +5,013,241 | +29M | +br | +2,890,384 | +16M | +br | +
Bulgarian | +2,947,648,106 | +32G | +bg | +1,268,114,977 | +14G | +bg | +
Burmese | +56,111,184 | +1.9G | +my | +30,102,173 | +1.1G | +my | +
Catalan | +1,360,212,450 | +8.0G | +ca | +729,333,440 | +4.3G | +ca | +
Cebuano | +6,603,567 | +39M | +ceb | +3,675,024 | +24M | +ceb | +
Central Bikol | +312 | +885 | +bcl | +312 | +885 | +bcl | +
Central Khmer | +20,690,610 | +1.1G | +km | +10,082,245 | +581M | +km | +
Central Kurdish | +48,478,334 | +487M | +ckb | +18,726,721 | +226M | +ckb | +
Chavacano | +130 | +520 | +cbk | +130 | +520 | +cbk | +
Chechen | +711,051 | +8.3M | +ce | +568,146 | +6.7M | +ce | +
Chinese | +14,986,424,850 | +508G | +zh | +6,350,215,113 | +249G | +zh | +
Chuvash | +3,041,614 | +39M | +cv | +2,054,810 | +26M | +cv | +
Cornish | +8,329 | +44K | +kw | +2,704 | +14K | +kw | +
Croatian | +34,232,765 | +226M | +hr | +16,727,640 | +110M | +hr | +
Czech | +7,715,977,441 | +53G | +cs | +3,540,997,509 | +24G | +cs | +
Danish | +2,637,463,889 | +16G | +da | +1,620,091,317 | +9.5G | +da | +
Dhivehi | +7,559,472 | +126M | +dv | +4,726,660 | +79M | +dv | +
Dimli | +19 | +146 | +diq | +19 | +146 | +diq | +
Dutch | +13,020,136,373 | +78G | +nl | +6,598,786,137 | +39G | +nl | +
Eastern Mari | +565,992 | +7.2M | +mhr | +469,297 | +6.0M | +mhr | +
Egyptian Arabic | +7,305,151 | +66M | +arz | +3,659,419 | +33M | +arz | +
Emilian-Romagnol | +6,376 | +25K | +eml | +6,121 | +24K | +eml | +
English | +418,187,793,408 | +2.3T | +en | +215,841,256,971 | +1.2T | +en | +
Erzya | +90 | +1.4K | +myv | +78 | +1.2K | +myv | +
Esperanto | +48,486,161 | +299M | +eo | +37,324,446 | +228M | +eo | +
Estonian | +643,163,730 | +4.8G | +et | +309,931,463 | +2.3G | +et | +
Finnish | +3,196,666,419 | +27G | +fi | +1,597,855,468 | +13G | +fi | +
French | +46,896,036,417 | +282G | +fr | +23,206,776,649 | +138G | +fr | +
Galician | +102,011,291 | +620M | +gl | +63,600,602 | +384M | +gl | +
Georgian | +171,950,621 | +3.6G | +ka | +91,569,739 | +1.9G | +ka | +
German | +44,878,908,446 | +308G | +de | +21,529,164,172 | +145G | +de | +
Goan Konkani | +124,277 | +2.2M | +gom | +102,306 | +1.8M | +gom | +
Guarani | +7,382 | +36K | +gn | +4,680 | +24K | +gn | +
Gujarati | +72,045,701 | +1.1G | +gu | +50,023,432 | +722M | +gu | +
Haitian | +1,014 | +3.9K | +ht | +832 | +3.3K | +ht | +
Hebrew | +2,067,753,528 | +20G | +he | +1,032,018,056 | +9.8G | +he | +
Hindi | +1,372,234,782 | +17G | +hi | +745,774,934 | +8.9G | +hi | +
Hungarian | +5,163,936,345 | +40G | +hu | +2,339,127,555 | +18G | +hu | +
Icelandic | +219,900,094 | +1.5G | +is | +129,818,331 | +846M | +is | +
Ido | +25,702 | +147K | +io | +22,773 | +130K | +io | +
Iloko | +142,942 | +874K | +ilo | +105,564 | +636K | +ilo | +
Indonesian | +4,574,692,265 | +30G | +id | +2,394,957,629 | +16G | +id | +
Interlingua | +180,231 | +662K | +ia | +100,019 | +360K | +ia | +
Interlingue | +5,352 | +24K | +ie | +602 | +1.6K | +ie | +
Irish | +14,483,593 | +88M | +ga | +10,017,303 | +60M | +ga | +
Italian | +22,248,707,341 | +137G | +it | +11,250,012,896 | +69G | +it | +
Japanese | +4,962,979,182 | +216G | +ja | +1,123,067,063 | +106G | +ja | +
Javanese | +104,896 | +659K | +jv | +86,654 | +583K | +jv | +
Kalmyk | +10,277 | +113K | +xal | +10,155 | +112K | +xal | +
Kannada | +81,186,863 | +1.7G | +kn | +49,343,462 | +1.1G | +kn | +
Karachay-Balkar | +185,436 | +2.6M | +krc | +166,496 | +2.3M | +krc | +
Kazakh | +191,126,469 | +2.7G | +kk | +108,388,743 | +1.5G | +kk | +
Kirghiz | +44,194,823 | +600M | +ky | +28,982,620 | +388M | +ky | +
Komi | +201,404 | +2.3M | +kv | +95,243 | +1.2M | +kv | +
Korean | +2,368,765,142 | +24G | +ko | +1,120,375,149 | +12G | +ko | +
Kurdish | +15,561,003 | +94M | +ku | +9,946,440 | +60M | +ku | +
Lao | +4,133,311 | +174M | +lo | +2,583,342 | +114M | +lo | +
Latin | +4,122,201 | +26M | +la | +1,328,038 | +8.3M | +la | +
Latvian | +520,761,977 | +4.0G | +lv | +236,428,905 | +1.8G | +lv | +
Lezghian | +247,646 | +3.3M | +lez | +224,871 | +3.0M | +lez | +
Limburgan | +4,730 | +29K | +li | +4,283 | +27K | +li | +
Lithuanian | +1,159,661,742 | +8.8G | +lt | +516,183,525 | +3.9G | +lt | +
Lojban | +154,330 | +736K | +jbo | +141,973 | +678K | +jbo | +
Lombard | +75,229 | +443K | +lmo | +73,665 | +433K | +lmo | +
Low German | +2,906,347 | +18M | +nds | +2,146,417 | +13M | +nds | +
Lower Sorbian | +1,787 | +13K | +dsb | +966 | +7.1K | +dsb | +
Luxembourgish | +4,403,577 | +29M | +lb | +3,087,650 | +21M | +lb | +
Macedonian | +189,289,873 | +2.1G | +mk | +102,849,595 | +1.2G | +mk | +
Maithili | +69,161 | +317K | +mai | +874 | +11K | +mai | +
Malagasy | +3,068,360 | +21M | +mg | +1,872,044 | +13M | +mg | +
Malay | +16,696,882 | +111M | +ms | +6,045,753 | +42M | +ms | +
Malayalam | +189,534,472 | +4.9G | +ml | +95,892,551 | +2.5G | +ml | +
Maltese | +2,995,654 | +24M | +mt | +2,163,358 | +17M | +mt | +
Marathi | +162,609,404 | +2.7G | +mr | +82,130,803 | +1.4G | +mr | +
Mazanderani | +73,870 | +691K | +mzn | +64,481 | +602K | +mzn | +
Minangkabau | +5,682 | +608K | +min | +4,825 | +310K | +min | +
Mingrelian | +299,098 | +5.8M | +xmf | +228,629 | +4.4M | +xmf | +
Mirandese | +171 | +1.2K | +mwl | +152 | +1.1K | +mwl | +
Modern Greek | +5,479,180,137 | +62G | +el | +2,412,419,435 | +27G | +el | +
Mongolian | +181,307,167 | +2.2G | +mn | +68,362,013 | +838M | +mn | +
Nahuatl languages | +1,234 | +12K | +nah | +1,193 | +11K | +nah | +
Neapolitan | +5,282 | +17K | +nap | +4,147 | +13K | +nap | +
Nepali | +107,448,208 | +1.8G | +ne | +71,628,317 | +1.2G | +ne | +
Newari | +564,697 | +5.5M | +new | +288,995 | +4.1M | +new | +
Northern Frisian | +1,516 | +4.4K | +frr | +1,516 | +4.4K | +frr | +
Northern Luri | +8,022 | +76K | +lrc | +6,740 | +63K | +lrc | +
Norwegian | +1,344,326,388 | +8.0G | +no | +804,894,377 | +4.7G | +no | +
Norwegian Nynorsk | +14,764,980 | +85M | +nn | +9,435,139 | +54M | +nn | +
Occitan | +750,301 | +5.8M | +oc | +512,678 | +3.7M | +oc | +
Oriya | +14,938,567 | +248M | +or | +11,321,740 | +188M | +or | +
Ossetian | +1,031,268 | +13M | +os | +878,765 | +11M | +os | +
Pampanga | +130 | +760 | +pam | +52 | +304 | +pam | +
Panjabi | +61,847,806 | +763M | +pa | +37,555,835 | +460M | +pa | +
Persian | +9,096,554,121 | +79G | +fa | +4,363,505,319 | +38G | +fa | +
Piemontese | +362,013 | +2.1M | +pms | +337,246 | +1.9M | +pms | +
Polish | +15,277,255,137 | +109G | +pl | +6,708,709,674 | +47G | +pl | +
Portuguese | +20,641,903,898 | +124G | +pt | +10,751,156,918 | +64G | +pt | +
Pushto | +46,559,441 | +361M | +ps | +31,347,348 | +242M | +ps | +
Quechua | +10,186 | +78K | +qu | +8,691 | +67K | +qu | +
Romanian | +3,984,317,058 | +25G | +ro | +1,741,794,069 | +11G | +ro | +
Romansh | +1,093 | +7.4K | +rm | +960 | +6.5K | +rm | +
Russia Buriat | +963 | +13K | +bxr | +809 | +11K | +bxr | +
Russian | +92,522,407,837 | +1.2T | +ru | +46,692,691,520 | +568G | +ru | +
Sanskrit | +4,331,569 | +93M | +sa | +1,713,930 | +37M | +sa | +
Scottish Gaelic | +310,689 | +1.9M | +gd | +207,110 | +1.3M | +gd | +
Serbian | +364,395,411 | +3.9G | +sr | +207,561,168 | +2.2G | +sr | +
Serbo-Croatian | +5,292,184 | +25M | +sh | +1,040,573 | +5.8M | +sh | +
Sicilian | +554 | +3.3K | +scn | +468 | +2.8K | +scn | +
Sindhi | +43,530,158 | +347M | +sd | +33,028,015 | +263M | +sd | +
Sinhala | +93,053,465 | +1.4G | +si | +50,864,857 | +802M | +si | +
Slovak | +1,322,247,763 | +9.1G | +sk | +656,346,179 | +4.5G | +sk | +
Slovenian | +387,399,700 | +2.5G | +sl | +193,926,684 | +1.3G | +sl | +
Somali | +1,202 | +61K | +so | +472 | +16K | +so | +
South Azerbaijani | +2,175,054 | +27M | +azb | +1,528,709 | +19M | +azb | +
Spanish | +47,545,122,279 | +278G | +es | +25,928,290,729 | +149G | +es | +
Sundanese | +30,321 | +211K | +su | +20,278 | +141K | +su | +
Swahili | +2,211,927 | +13M | +sw | +1,376,963 | +8.1M | +sw | +
Swedish | +7,155,994,312 | +44G | +sv | +4,106,120,608 | +25G | +sv | +
Tagalog | +98,949,299 | +573M | +tl | +70,121,601 | +407M | +tl | +
Tajik | +31,758,142 | +379M | +tg | +21,029,893 | +249M | +tg | +
Tamil | +420,537,132 | +9.3G | +ta | +226,013,330 | +5.1G | +ta | +
Tatar | +51,034,893 | +670M | +tt | +23,825,695 | +305M | +tt | +
Telugu | +123,711,517 | +2.5G | +te | +79,094,167 | +1.6G | +te | +
Thai | +951,743,087 | +36G | +th | +368,965,202 | +16G | +th | +
Tibetan | +1,483,589 | +187M | +bo | +936,556 | +138M | +bo | +
Turkish | +7,577,388,700 | +60G | +tr | +3,365,734,289 | +27G | +tr | +
Turkmen | +1,113,869 | +11M | +tk | +752,326 | +6.8M | +tk | +
Tuvinian | +759 | +12K | +tyv | +540 | +7.9K | +tyv | +
Uighur | +8,657,141 | +122M | +ug | +5,852,225 | +83M | +ug | +
Ukrainian | +4,204,381,276 | +53G | +uk | +2,252,380,351 | +28G | +uk | +
Upper Sorbian | +545,351 | +4.2M | +hsb | +236,867 | +1.8M | +hsb | +
Urdu | +331,817,982 | +2.7G | +ur | +218,030,228 | +1.7G | +ur | +
Uzbek | +2,450,256 | +21M | +uz | +1,381,644 | +12M | +uz | +
Venetian | +3,492 | +18K | +vec | +3,199 | +17K | +vec | +
Vietnamese | +12,036,845,359 | +68G | +vi | +5,577,159,843 | +32G | +vi | +
Volapük | +321,121 | +2.0M | +vo | +318,568 | +2.0M | +vo | +
Walloon | +50,720 | +273K | +wa | +37,543 | +203K | +wa | +
Waray | +397,315 | +2.5M | +war | +336,311 | +2.2M | +war | +
Welsh | +37,422,441 | +213M | +cy | +23,574,673 | +133M | +cy | +
Western Frisian | +5,691,077 | +35M | +fy | +4,223,816 | +26M | +fy | +
Western Mari | +93,338 | +1.2M | +mrj | +87,780 | +1.1M | +mrj | +
Western Panjabi | +1,426,986 | +12M | +pnb | +1,111,112 | +9.0M | +pnb | +
Wu Chinese | +11,189 | +109K | +wuu | +4,333 | +32K | +wuu | +
Yakut | +2,547,623 | +42M | +sah | +1,789,174 | +26M | +sah | +
Yiddish | +13,834,320 | +141M | +yi | +8,212,970 | +84M | +yi | +
Yoruba | +8,906 | +55K | +yo | +3,518 | +27K | +yo | +
Yue Chinese | +186 | +3.7K | +yue | +128 | +2.2K | +yue | +
These data are released under this licensing scheme:
+Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
+Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
+Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:
+Here is a list of Language models trained by the community:
+Model | +Language | +Cased | +Corpus | +Authors | +Paper | +Website | +Files | +License | +
---|---|---|---|---|---|---|---|---|
AraBERT | +Arabic | +Cased | +OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | +Wissam Antoun, Fady Baly and Hazem Hajj | +ACL Anthology | +GitHub | +Hugging Face | +N/A | +
Arabic-BERT | +Arabic | +Cased | +OSCAR and Wikipedia | +Ali Safaya | +ArXiv | +GitHub | +Hugging Face | +MIT | +
AraELECTRA | +Arabic | +Cased | +OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | +Wissam Antoun, Fady Baly and Hazem Hajj | +ArXiV | +GitHub | +Hugging Face | +N/A | +
AraGPT2 | +Arabic | +Cased | +OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | +Wissam Antoun, Fady Baly and Hazem Hajj | +ArXiv | +GitHub | +Hugging Face | +N/A | +
CamemBERT | +French | +Cased | +OSCAR | +Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot | +ACL 2020 | +camembert-model.fr | +camembert-base.tar.gz | +MIT | +
CamemBERT | +French | +Cased | +Subsample of OSCAR (4 GB of text) | +Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot | +ACL 2020 | +camembert-model.fr | +camembert-base-oscar-4gb.tar.gz | +MIT | +
LePetit | +French | +Cased | +Subsample of OSCAR (2 GB of text) | +Vincent Micheli, Martin d'Hoffschmidt, Quentin Heinrich | +Medium blog | +illuin.tech | +Hugging Face | +MIT | +
GigaBERT | +Arabic | +Cased and Uncased | +OSCAR, Wikipedia, Gigaword | +Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter | +EMNLP 2020 | +GitHub | +Hugging Face | +MIT | +
ELECTRA | +Norwegian | +Cased | +OSCAR and OPUS | +Viktor Alm | +N/A | +Hugging Face | +Hugging Face | +N/A | +
BERT | +Romanian | +Cased | +OSCAR, Wikipedia and OPUS | +Dumitrescu Stefan and Andrei Avram | +SOON | +GitHub | +Hugging Face | +MIT | +
BERT | +Romanian | +Uncased | +OSCAR, Wikipedia and OPUS | +Dumitrescu Stefan and Andrei Avram | +SOON | +GitHub | +Hugging Face | +MIT | +
RoBERTa | +Sinhala | +N/A | +OSCAR | +Keshan Sodimana | +N/A | +Hugging Face | +Hugging Face | +N/A | +
BERT | +Turkish | +Cased and Uncased | +OSCAR, Wikipedia and OPUS | +Stefan Schweter | +Zenodo | +GitHub | +Hugging Face | +MIT | +
ELECTRA | +Turkish | +Cased | +OSCAR, Wikipedia and OPUS | +Stefan Schweter | +Zenodo | +GitHub | +Hugging Face | +MIT | +
XLMIndic | +Hindi, Bengali, Gujarati, Panjabi, Marathi, Oriya, Assamese, Sinhala, Nepali, Bihari, Bishnupriya, Maithili, Goan Konkani, Sanskrit | +Cased | +OSCAR | +Ibraheem Muhammad Moosa, Mahmud Shimul and Ashfia Binte Habib | +Arxiv | +GitHub | +Hugging Face | +MIT | +
If you have trained a model using the OSCAR corpus and would like to have it featured here, please open a pull request in our GitHub repo. Help us grow the community!
+ + + + + + +These are the versions of tooling, schemes and data
++ | Language | +OSCAR 2019 | +OSCAR 2019 deduplicated | +OSCAR 21.09 | +OSCAR 21.09 deduplicated | +Issues | +
---|---|---|---|---|---|---|
af | +Afrikaans | +251MB | +170MB | +258MB | +157MB | ++ |
sq | +Albanian | +2GB | +1GB | +3GB | +1GB | ++ |
am | +Amharic | +377MB | +215MB | +405MB | +241MB | ++ |
ar | +Arabic | +87GB | +33GB | +69GB | +35GB | ++ |
an | +Aragonese | +1MB | +822KB | +1MB | +608KB | ++ |
hy | +Armenian | +3GB | +1GB | +4GB | +1GB | ++ |
as | +Assamese | +117MB | +73MB | +135MB | +95MB | ++ |
ast | +Asturian | +2MB | +2MB | +7MB | +4MB | ++ |
av | +Avaric | +418KB | +331KB | +421KB | +325KB | ++ |
az | +Azerbaijani | +2GB | +1GB | +3GB | +1GB | ++ |
bn | +Bangla | +10GB | +6GB | +14GB | +7GB | ++ |
ba | +Bashkir | +133MB | +93MB | +110MB | +77MB | ++ |
eu | +Basque | +889MB | +358MB | +900MB | +503MB | ++ |
bar | +Bavarian | +507B | +507B | +2KB | +1KB | ++ |
be | +Belarusian | +1GB | +1GB | +2GB | +1GB | ++ |
bh | +Bihari languages | +112KB | +34KB | +579KB | +120KB | ++ |
bpy | +Bishnupriya | +4MB | +1MB | +11MB | +4MB | ++ |
bs | +Bosnian | +459KB | +120KB | +310KB | +175KB | ++ |
br | +Breton | +29MB | +16MB | +49MB | +23MB | ++ |
bg | +Bulgarian | +33GB | +14GB | +34GB | +15GB | ++ |
my | +Burmese | +2GB | +1GB | +2GB | +1GB | ++ |
yue | +Cantonese | +3KB | +2KB | +- | +- | ++ |
ca | +Catalan | +8GB | +4GB | +13GB | +6GB | ++ |
ceb | +Cebuano | +40MB | +24MB | +81MB | +58MB | ++ |
bcl | +Central Bikol | +886B | +886B | +- | +- | ++ |
ckb | +Central Kurdish | +509MB | +236MB | +784MB | +367MB | ++ |
cbk | +Chavacano | +521B | +521B | +168B | +168B | +{{< issue cbk >}} | +
ce | +Chechen | +8MB | +6MB | +29MB | +20MB | ++ |
zh | +Chinese | +544GB | +267GB | +500GB | +266GB | ++ |
cv | +Chuvash | +40MB | +27MB | +60MB | +41MB | ++ |
kw | +Cornish | +44KB | +14KB | +119KB | +72KB | ++ |
hr | +Croatian | +237MB | +115MB | +361MB | +169MB | ++ |
cs | +Czech | +56GB | +25GB | +72GB | +33GB | ++ |
da | +Danish | +16GB | +10GB | +18GB | +10GB | ++ |
diq | +Dimli (individual language) | +147B | +147B | +294B | +147B | ++ |
dv | +Divehi | +131MB | +81MB | +143MB | +111MB | ++ |
nl | +Dutch | +82GB | +41GB | +97GB | +47GB | ++ |
mhr | +Eastern Mari | +7MB | +6MB | +15MB | +10MB | ++ |
arz | +Egyptian Arabic | +68MB | +34MB | +48MB | +21MB | ++ |
en | +English | +2520GB | +1294GB | +2936GB | +1342GB | ++ |
myv | +Erzya | +1KB | +1KB | +29KB | +2KB | ++ |
eo | +Esperanto | +312MB | +238MB | +560MB | +390MB | ++ |
et | +Estonian | +5GB | +2GB | +7GB | +3GB | ++ |
tl | +Filipino | +601MB | +426MB | +699MB | +383MB | ++ |
fi | +Finnish | +28GB | +13GB | +35GB | +20GB | ++ |
fr | +French | +302GB | +147GB | +340GB | +161GB | ++ |
gl | +Galician | +650MB | +402MB | +989MB | +549MB | ++ |
ka | +Georgian | +3GB | +1GB | +6GB | +2GB | ++ |
de | +German | +330GB | +155GB | +433GB | +184GB | ++ |
gom | +Goan Konkani | +2MB | +1MB | +3MB | +2MB | ++ |
el | +Greek | +66GB | +28GB | +72GB | +30GB | ++ |
gn | +Guarani | +36KB | +23KB | +32KB | +25KB | ++ |
gu | +Gujarati | +1GB | +756MB | +1GB | +950MB | ++ |
ht | +Haitian Creole | +3KB | +3KB | +2KB | +1KB | ++ |
he | +Hebrew | +21GB | +10GB | +29GB | +11GB | ++ |
hi | +Hindi | +17GB | +9GB | +26GB | +13GB | ++ |
hu | +Hungarian | +42GB | +18GB | +60GB | +29GB | ++ |
is | +Icelandic | +1GB | +887MB | +2GB | +1GB | ++ |
io | +Ido | +151KB | +133KB | +276KB | +221KB | ++ |
ilo | +Iloko | +896KB | +653KB | +1MB | +857KB | ++ |
id | +Indonesian | +32GB | +16GB | +40GB | +22GB | ++ |
ia | +Interlingua | +678KB | +368KB | +291KB | +172KB | ++ |
ie | +Interlingue | +24KB | +1KB | +7KB | +2KB | ++ |
ga | +Irish | +91MB | +62MB | +131MB | +69MB | ++ |
it | +Italian | +146GB | +73GB | +192GB | +94GB | ++ |
ja | +Japanese | +231GB | +112GB | +208GB | +96GB | ++ |
jv | +Javanese | +675KB | +598KB | +858KB | +728KB | ++ |
xal | +Kalmyk | +115KB | +114KB | +62KB | +62KB | ++ |
kn | +Kannada | +1GB | +1GB | +2GB | +1GB | ++ |
krc | +Karachay-Balkar | +2MB | +2MB | +2MB | +2MB | ++ |
kk | +Kazakh | +2GB | +1GB | +3GB | +1GB | ++ |
km | +Khmer | +1GB | +608MB | +1GB | +860MB | ++ |
kv | +Komi | +2MB | +1MB | +1MB | +588KB | ++ |
ko | +Korean | +25GB | +11GB | +35GB | +15GB | ++ |
ku | +Kurdish | +98MB | +62MB | +152MB | +108MB | ++ |
ky | +Kyrgyz | +629MB | +406MB | +485MB | +334MB | ++ |
lo | +Lao | +181MB | +118MB | +287MB | +163MB | ++ |
la | +Latin | +26MB | +8MB | +103MB | +9MB | ++ |
lv | +Latvian | +4GB | +1GB | +6GB | +2GB | ++ |
lez | +Lezghian | +3MB | +3MB | +2MB | +2MB | ++ |
li | +Limburgish | +29KB | +27KB | +76KB | +54KB | ++ |
lt | +Lithuanian | +9GB | +4GB | +12GB | +5GB | ++ |
jbo | +Lojban | +753KB | +694KB | +929KB | +731KB | ++ |
lmo | +Lombard | +454KB | +444KB | +1MB | +1MB | ++ |
nds | +Low German | +18MB | +13MB | +25MB | +17MB | ++ |
dsb | +Lower Sorbian | +13KB | +7KB | +31KB | +14KB | ++ |
lb | +Luxembourgish | +30MB | +21MB | +54MB | +37MB | ++ |
mk | +Macedonian | +2GB | +1GB | +3GB | +1GB | ++ |
mai | +Maithili | +324KB | +10KB | +685KB | +24KB | ++ |
mg | +Malagasy | +21MB | +13MB | +59MB | +38MB | ++ |
ms | +Malay | +116MB | +43MB | +146MB | +60MB | ++ |
ml | +Malayalam | +5GB | +2GB | +4GB | +2GB | ++ |
mt | +Maltese | +24MB | +17MB | +51MB | +26MB | ++ |
gv | +Manx | +- | +- | +1KB | +907B | ++ |
mr | +Marathi | +2GB | +1GB | +3GB | +1GB | ++ |
mzn | +Mazanderani | +708KB | +617KB | +1MB | +1MB | ++ |
min | +Minangkabau | +622KB | +317KB | +8MB | +1MB | ++ |
xmf | +Mingrelian | +6MB | +4MB | +16MB | +10MB | ++ |
mwl | +Mirandese | +1KB | +1KB | +3KB | +2KB | ++ |
mn | +Mongolian | +2GB | +879MB | +1GB | +912MB | ++ |
nah | +Nahuatl languages | +11KB | +10KB | +34KB | +21KB | ++ |
nap | +Neapolitan | +17KB | +13KB | +1KB | +1KB | +{{< issue nap >}} | +
ne | +Nepali | +1GB | +1GB | +3GB | +2GB | ++ |
new | +Newari | +5MB | +4MB | +6MB | +4MB | ++ |
frr | +Northern Frisian | +4KB | +4KB | +7KB | +5KB | +{{< issue frr >}} | +
lrc | +Northern Luri | +77KB | +64KB | +183B | +183B | ++ |
no | +Norwegian Bokmål | +8GB | +5GB | +9GB | +4GB | ++ |
nn | +Norwegian Nynorsk | +88MB | +56MB | +123MB | +66MB | ++ |
oc | +Occitan | +6MB | +3MB | +12MB | +5MB | ++ |
or | +Odia | +259MB | +196MB | +538MB | +357MB | ++ |
os | +Ossetic | +12MB | +10MB | +11MB | +6MB | ++ |
pam | +Pampanga | +763B | +307B | +3KB | +3KB | ++ |
ps | +Pashto | +378MB | +253MB | +404MB | +286MB | ++ |
fa | +Persian | +84GB | +39GB | +79GB | +35GB | ++ |
pms | +Piedmontese | +2MB | +1MB | +4MB | +3MB | ++ |
pl | +Polish | +116GB | +50GB | +122GB | +48GB | ++ |
pt | +Portuguese | +132GB | +67GB | +159GB | +71GB | ++ |
pa | +Punjabi | +799MB | +481MB | +769MB | +430MB | ++ |
qu | +Quechua | +80KB | +68KB | +322KB | +230KB | ++ |
ro | +Romanian | +26GB | +11GB | +37GB | +15GB | ++ |
rm | +Romansh | +7KB | +6KB | +3KB | +3KB | ++ |
bxr | +Russia Buriat | +12KB | +10KB | +22KB | +18KB | ++ |
ru | +Russian | +1239GB | +609GB | +1201GB | +542GB | ++ |
rue | +Rusyn | +- | +- | +247B | +247B | ++ |
sah | +Sakha | +43MB | +27MB | +57MB | +39MB | ++ |
sa | +Sanskrit | +96MB | +38MB | +72MB | +43MB | ++ |
sco | +Scots | +- | +- | +1KB | +1KB | +{{< issue sco >}} | +
gd | +Scottish Gaelic | +1MB | +1MB | +2MB | +1MB | ++ |
sr | +Serbian | +4GB | +2GB | +6GB | +3GB | ++ |
sh | +Serbian (Latin) | +25MB | +6MB | +13MB | +9MB | ++ |
scn | +Sicilian | +3KB | +2KB | +4KB | +3KB | ++ |
sd | +Sindhi | +363MB | +274MB | +75MB | +50MB | ++ |
si | +Sinhala | +1GB | +840MB | +1GB | +791MB | ++ |
sk | +Slovak | +9GB | +4GB | +14GB | +6GB | ++ |
sl | +Slovenian | +2GB | +1GB | +4GB | +1GB | ++ |
so | +Somali | +62KB | +15KB | +15KB | +13KB | +{{< issue so >}} | +
azb | +South Azerbaijani | +28MB | +19MB | +47MB | +29MB | ++ |
es | +Spanish | +297GB | +159GB | +342GB | +160GB | ++ |
su | +Sundanese | +216KB | +145KB | +397KB | +274KB | ++ |
sw | +Swahili | +13MB | +8MB | +11MB | +7MB | ++ |
sv | +Swedish | +46GB | +26GB | +43GB | +19GB | ++ |
tg | +Tajik | +396MB | +260MB | +985MB | +321MB | +{{< issue tg >}} | +
ta | +Tamil | +9GB | +5GB | +10GB | +5GB | ++ |
tt | +Tatar | +701MB | +319MB | +947MB | +424MB | ++ |
te | +Telugu | +2GB | +1GB | +3GB | +1GB | ++ |
th | +Thai | +38GB | +17GB | +62GB | +26GB | ++ |
bo | +Tibetan | +195MB | +144MB | +439MB | +358MB | ++ |
gsw[^1] | +Alemannic German | +5MB | +2MB | +7MB | +5MB | ++ |
tr | +Turkish | +63GB | +28GB | +73GB | +33GB | +{{< issue tr >}} | +
tk | +Turkmen | +10MB | +7MB | +25MB | +20MB | ++ |
tyv | +Tuvinian | +11KB | +8KB | +9KB | +7KB | ++ |
uk | +Ukrainian | +56GB | +29GB | +53GB | +28GB | ++ |
eml | +Emiliano-Romagnolo[^2] | +25KB | +23KB | +22KB | +20KB | ++ |
hsb | +Upper Sorbian | +4MB | +1MB | +2MB | +1MB | ++ |
ur | +Urdu | +2GB | +1GB | +2GB | +1GB | ++ |
ug | +Uyghur | +127MB | +86MB | +187MB | +123MB | ++ |
uz | +Uzbek | +21MB | +11MB | +56MB | +28MB | ++ |
vec | +Venetian | +18KB | +16KB | +37KB | +28KB | ++ |
vi | +Vietnamese | +72GB | +33GB | +87GB | +42GB | ++ |
vo | +Volapük | +2MB | +2MB | +2MB | +2MB | ++ |
wa | +Walloon | +280KB | +207KB | +511KB | +329KB | ++ |
war | +Waray | +2MB | +2MB | +4MB | +4MB | ++ |
cy | +Welsh | +223MB | +139MB | +307MB | +180MB | ++ |
vls | +West Flemish | +- | +- | +134B | +134B | +{{< issue vls >}} | +
fy | +Western Frisian | +35MB | +26MB | +82MB | +57MB | ++ |
mrj | +Western Mari | +1MB | +1MB | +645KB | +521KB | ++ |
pnb | +Western Panjabi | +11MB | +9MB | +68MB | +45MB | ++ |
wuu | +Wu Chinese | +111KB | +32KB | +145KB | +69KB | +{{< issue wuu >}} | +
yi | +Yiddish | +146MB | +87MB | +199MB | +93MB | ++ |
yo | +Yoruba | +56KB | +26KB | +229KB | +120KB | ++ |
The new OSCAR schema incorporates backward-compatible changes.
+The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:
+/
+├── af
+│ ├── af_sha256.txt
+│ └── af.txt.gz
+├── de
+│ ├── de_sha256.txt # Checksum file
+│ └── de.txt.gz # Textual content
+├── en
+│ ├── en_part_1.txt.gz # Multipart example
+│ ├── en_part_2.txt.gz
+│ └── en_sha256.txt
+├── yi
+│ ├── yi_sha256.txt
+│ └── yi.txt.gz
+└── zh
+ ├── zh_sha256.txt
+ └── zh.txt.gz
+
The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):
+/
+├── af
+│ ├── af_meta.jsonl.gz
+│ ├── af_sha256.txt
+│ └── af.txt.gz
+├── de
+│ ├── de_meta.jsonl.gz # Metadata, in JSONLines format
+│ ├── de_sha256.txt # Checksum file
+│ └── de.txt.gz # Textual content
+├── en
+│ ├── en_meta_part_1.jsonl.gz # Multipart example
+│ ├── en_meta_part_2.jsonl.gz # Each part is independent,
+│ ├── en_part_1.txt.gz # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz
+│ ├── en_part_2.txt.gz
+│ └── en_sha256.txt
+├── yi
+│ ├── yi_meta.jsonl.gz
+│ ├── yi_sha256.txt
+│ └── yi.txt.gz
+└── zh
+ ├── zh_meta.jsonl.gz
+ ├── zh_sha256.txt
+ └── zh.txt.gz
+
.txt
filesLines are newline-separated, and documents are double-newline separated. +In other terms, there is a blank line between each document.
+.jsonl
filesThese are the metadata, in JSONLines format.
+Each line follows the following JSON Scheme:
+{
+ "$schema": "http://json-schema.org/draft-07/schema#",
+ "title": "Metadata",
+ "description": "Holds record headers.\n\nEach metadata is linked to a specific paragraph/text zone",
+ "type": "object",
+ "required": [
+ "headers",
+ "nb_sentences",
+ "offset"
+ ],
+ "properties": {
+ "headers": {
+ "type": "object",
+ "additionalProperties": {
+ "type": "string"
+ }
+ },
+ "nb_sentences": {
+ "type": "integer",
+ "format": "uint",
+ "minimum": 0.0
+ },
+ "offset": {
+ "type": "integer",
+ "format": "uint",
+ "minimum": 0.0
+ }
+ }
+}
+
Example: +
{
+ "headers":{ // these headers keys are *almost* always present.
+ "content-length":"11062", // the content length is not changed and reflects the
+ // length before filtering and eventual deduplication.
+ "warc-target-uri":"...",
+ "warc-type":"conversion",
+ "content-type":"text/plain",
+ "warc-date":"2021-02-24T17:55:29Z", // Following WARC specification, it is the crawl date.
+ "warc-identified-content-language":"eng,zho",
+ "warc-refers-to":"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>",
+ "warc-block-digest":"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X",
+ "warc-record-id":"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>"
+ },
+ "offset":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.
+ "nb_sentences":9
+}
+
<lang>_sha256.txt
filesThese are used to check for eventual corruption during download.
+They can be used by running sha256sum -c <lang>_sha256.txt
.
[^1]: gsw
is ISO 639-2 for Alemannic German. It was previously identified as als
in previous OSCAR versions, due to a bug in fasttext.
+[^2]: eml
identification tag is deprecated and corresponds to rgn
and egl
tags in ISO 639-3
OSCAR 2201 is the OSCAR version from January, 2022, the November/December 2021 dump of Common Crawl. +It features a different file layout that makes it not backward compatible with code designed to run with previous OSCAR versions.
+Request access +🤗 Datasets +Read the paper
+OSCAR 22.01 is document-oriented, which means that rather than extracting lines and sorting them in language subcorpora, we identify documents as a whole. The main differences are that sentences in a document are contiguous and should make sense one after another, but sentences are not guaranteed to be of the subcorpus' language.
+Note
+As an example, the English Wikipedia page about La Marseillaise contains sentences in French (The anthem's lyrics). In line-oriented corpora, these sentences would have been put in the French subcorpus. In OSCAR 22.01, they should be along with the article, in a document classified as English.
+As previous corpora, there is one subcorpus per language, plus one new subcorpus for multilingual documents. +Subcorpora are distributed in JSONLines, split into 1GB chunks, then gzipped.
+Note
+Splits are completely independent and self-contained: It is possible to only download en_meta_134.jsonl.gz
and to do processing on it.
{
+ "content":"newline\nseparaaaaaaaaaaated\ncontent", // (1)
+ "warc_headers":{ // (2)
+ "warc-refers-to":"<urn:uuid:83f2e1d4-5ed3-41db-86ff-f7826c4c20f9>",
+ "warc-date":"2021-09-16T11:07:14Z",
+ "warc-block-digest":"sha1:X3OWP47FG2O5LBNMFSNB44FJF2SSRC26",
+ "warc-type":"conversion",
+ "warc-identified-content-language":"eng",
+ "content-length":"1694",
+ "warc-target-uri":"https://foo.bar",
+ "warc-record-id":"<urn:uuid:3304bc27-17d0-4ffd-a692-340381478a5f>",
+ "content-type":"text/plain"
+ },
+ "metadata":{
+ // (3)
+ "identification":{
+ "label":"en",
+ "prob":0.6268374
+ },
+
+ // (4)
+ "annotation":[
+ "short_sentences",
+ "footer"
+ ],
+
+ // (5)
+ "sentence_identifications":[
+ {
+ "label":"en",
+ "prob":0.93925816
+ },
+ null,
+ {
+ "label":"en",
+ "prob":0.9606543
+ }
+ ]
+ }
+}
+
\n
.prob
is the weighted average of the confidence of identified lines.null
if no annotation.null
for each line that has no identification.tiny
: The document has a low (<5) number of lines.short_sentences
: The document has a high number (>50%) of short lines (<400 bytes)header
: The document has a high number of short lines at its head, suggesting the presence of low quality content.footer
: The document has a high number of short lines at its tail, suggesting the presence of low quality content.noisy
: The document has a high percentage of punctuation (>50%)adult
: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.More information about the thresholds and annotators are present in our paper.
+Tip
+Filtering can be done using oscar-tools
, a high performance toolkit that provides rapid and efficient ways of transforming corpora into what you need. More info here.
Filtering can be done using classic Python tools, such as ujson
.
+While we don't supply a Python library enabling easy filtering/transformation for OSCAR 22.01, we provide some filtering examples that you can change to better suit your needs.
Using filters on warc_headers.warc-target-uri
makes filtering on URLs easy.
Non-annotated documents are suspected to be cleaner than annotated ones, so extracting their content should be interesting to do. We extract lines from documents where metadata.annotations == null
.
As detailed in our paper, we found that the German corpus has a (relative to the Alemannic corpus size) important amount of Alemannic. We use a filter on metadata.sentence_identifications
to extract those sentences.
OSCAR 22.01 has subcorpora for 142 languages (counting the Multilingual corpus). +The following table exhibits the size, number of documents and number of words for each of them.
+Note that the size accounts for the raw uncompressed file size, counting metadata.
+Language | +Size | +# Documents | +# Words | +
---|---|---|---|
Multilingual | +12.1 GB | +1,210,685 | +936,187,711 | +
Afrikaans | +47.0 MB | +12,393 | +6,227,310 | +
Albanian | +3.0 GB | +437,287 | +326,325,149 | +
Alemannic / Swiss German | +363.6 kB | +139 | +37,381 | +
Amharic | +461.0 MB | +37,513 | +30,481,153 | +
Arabic | +84.2 GB | +8,718,929 | +6,103,711,887 | +
Aragonese | +10.6 kB | +12 | +51 | +
Armenian | +4.7 GB | +379,267 | +268,031,270 | +
Assamese | +221.2 MB | +17,084 | +11,109,557 | +
Asturian | +73.6 kB | +77 | +3,919 | +
Avaric | +18.6 kB | +14 | +582 | +
Azerbaijani | +3.5 GB | +491,847 | +291,927,692 | +
Bangla | +15.1 GB | +1,171,501 | +751,877,226 | +
Bashkir | +95.5 MB | +11,198 | +5,418,474 | +
Basque | +1.1 GB | +233,658 | +97,092,942 | +
Belarusian | +1.8 GB | +180,046 | +107,227,860 | +
Bihari languages | +24.2 kB | +27 | +569 | +
Bishnupriya | +2.0 MB | +271 | +98,419 | +
Bosnian | +10.3 kB | +10 | +422 | +
Breton | +33.7 MB | +16,119 | +3,111,619 | +
Bulgarian | +35.1 GB | +2,887,115 | +2,405,981,285 | +
Burmese | +1.9 GB | +158,733 | +44,835,970 | +
Catalan | +13.9 GB | +2,627,307 | +1,508,919,864 | +
Cebuano | +44.6 MB | +5,742 | +5,253,785 | +
Central Kurdish | +716.4 MB | +84,950 | +43,913,025 | +
Chechen | +14.0 MB | +4,086 | +798,766 | +
Chinese | +900.9 GB | +56,524,518 | +23,149,203,886 | +
Chuvash | +41.8 MB | +4,750 | +2,465,782 | +
Cornish | +1.4 kB | +2 | +55 | +
Croatian | +11.2 MB | +11,462 | +505,369 | +
Czech | +58.6 GB | +10,381,916 | +5,452,724,456 | +
Danish | +12.6 GB | +2,265,479 | +1,454,439,292 | +
Dimli (individual language) | +706 Bytes | +1 | +19 | +
Divehi | +217.2 MB | +24,067 | +10,112,205 | +
Dutch | +114.0 GB | +20,206,532 | +12,329,127,151 | +
Eastern Mari | +11.3 MB | +1,612 | +641,525 | +
Egyptian Arabic | +2.8 MB | +1,256 | +176,096 | +
English | +3.2 TB | +431,992,659 | +377,376,402,775 | +
Esperanto | +558.3 MB | +111,932 | +58,416,628 | +
Estonian | +9.2 GB | +1,362,524 | +820,975,443 | +
Filipino | +646.5 MB | +70,394 | +81,881,278 | +
Finnish | +37.8 GB | +4,948,961 | +2,900,615,928 | +
French | +382.2 GB | +52,037,098 | +41,713,990,658 | +
Galician | +255.2 MB | +88,803 | +27,051,212 | +
Georgian | +7.1 GB | +488,588 | +281,430,479 | +
German | +496.7 GB | +70,075,424 | +46,826,676,844 | +
Goan Konkani | +787.2 kB | +46 | +38,831 | +
Greek | +78.3 GB | +6,738,546 | +5,031,242,803 | +
Guarani | +9.0 kB | +10 | +374 | +
Gujarati | +4.8 GB | +136,467 | +301,170,777 | +
Hebrew | +30.3 GB | +3,132,396 | +2,249,377,984 | +
Hindi | +23.3 GB | +1,529,907 | +1,534,799,198 | +
Hungarian | +53.9 GB | +6,866,062 | +4,598,787,907 | +
Icelandic | +2.0 GB | +396,183 | +210,365,124 | +
Ido | +77.3 kB | +105 | +2,690 | +
Iloko | +97.9 kB | +75 | +8,592 | +
Indonesian | +17.4 GB | +2,244,622 | +1,984,195,207 | +
Interlingua | +40.2 kB | +6 | +10,125 | +
Irish | +45.6 MB | +12,233 | +4,877,850 | +
Italian | +229.3 GB | +28,502,092 | +24,294,684,830 | +
Japanese | +258.7 GB | +36,328,931 | +5,592,948,356 | +
Javanese | +152.7 kB | +70 | +10,441 | +
Kalmyk | +9.3 kB | +9 | +250 | +
Kannada | +2.6 GB | +150,850 | +108,450,571 | +
Karachay-Balkar | +119.6 kB | +91 | +4,089 | +
Kazakh | +2.9 GB | +261,085 | +157,267,307 | +
Khmer | +1.9 GB | +121,910 | +30,564,131 | +
Komi | +119.9 kB | +127 | +3,335 | +
Korean | +51.8 GB | +5,881,481 | +3,854,968,649 | +
Kurdish | +150.3 MB | +29,906 | +17,390,759 | +
Kyrgyz | +518.6 MB | +62,244 | +28,028,986 | +
Lao | +337.1 MB | +28,914 | +6,682,982 | +
Latin | +4.1 MB | +4,397 | +187,446 | +
Latvian | +8.2 GB | +1,032,987 | +707,361,898 | +
Lezghian | +375.5 kB | +124 | +19,250 | +
Limburgish | +1.4 kB | +2 | +41 | +
Lithuanian | +20.0 GB | +2,303,070 | +1,712,802,056 | +
Lojban | +1.9 MB | +570 | +260,542 | +
Lombard | +2.6 kB | +2 | +225 | +
Low German | +9.0 MB | +1,938 | +1,012,561 | +
Lower Sorbian | +707 Bytes | +1 | +17 | +
Luxembourgish | +15.8 MB | +5,108 | +1,545,946 | +
Macedonian | +3.6 GB | +341,775 | +244,058,579 | +
Maithili | +21.6 kB | +23 | +483 | +
Malagasy | +57.3 MB | +3,028 | +7,279,056 | +
Malay | +5.3 MB | +5,228 | +217,818 | +
Malayalam | +4.1 GB | +250,972 | +137,831,247 | +
Maltese | +2.5 MB | +2,208 | +118,190 | +
Marathi | +3.3 GB | +250,376 | +160,179,233 | +
Mazanderani | +128.2 kB | +76 | +7,337 | +
Minangkabau | +6.0 MB | +585 | +614,613 | +
Mingrelian | +7.6 MB | +2,550 | +253,333 | +
Mongolian | +2.8 GB | +237,719 | +176,405,432 | +
Nahuatl languages | +8.7 kB | +12 | +179 | +
Nepali | +3.7 GB | +391,947 | +177,885,116 | +
Newari | +5.7 MB | +1,134 | +273,837 | +
Norwegian | +2.8 GB | +973,188 | +279,182,902 | +
Norwegian Nynorsk | +6.8 MB | +5,835 | +459,183 | +
Occitan | +2.1 MB | +373 | +31,061 | +
Odia | +487.9 MB | +52,942 | +23,755,902 | +
Ossetic | +13.9 MB | +3,560 | +800,430 | +
Pashto | +490.3 MB | +50,312 | +46,293,249 | +
Persian | +77.4 GB | +7,665,871 | +6,430,164,396 | +
Piedmontese | +1.7 MB | +698 | +188,270 | +
Polish | +139.0 GB | +19,301,137 | +12,584,498,906 | +
Portuguese | +170.3 GB | +23,735,707 | +18,441,864,893 | +
Punjabi | +1.1 GB | +68,094 | +70,068,604 | +
Quechua | +744 Bytes | +1 | +14 | +
Romanian | +49.2 GB | +4,624,764 | +5,261,803,995 | +
Russia Buriat | +32.9 kB | +39 | +785 | +
Russian | +1.1 TB | +76,060,844 | +62,811,122,663 | +
Sakha | +65.6 MB | +6,284 | +3,473,813 | +
Sanskrit | +136.0 MB | +4,472 | +5,671,369 | +
Scottish Gaelic | +137.7 kB | +136 | +7,769 | +
Serbian | +6.9 GB | +577,472 | +482,932,670 | +
Serbian (Latin) | +931.8 kB | +738 | +92,875 | +
Sicilian | +1.5 kB | +2 | +50 | +
Sindhi | +117.1 MB | +15,516 | +10,685,611 | +
Sinhala | +2.0 GB | +108,593 | +113,179,741 | +
Slovak | +16.5 GB | +2,409,555 | +1,619,121,944 | +
Slovenian | +1.2 GB | +351,894 | +118,400,246 | +
Somali | +2.1 kB | +3 | +109 | +
South Azerbaijani | +14.1 MB | +5,381 | +693,746 | +
Spanish | +381.9 GB | +51,386,247 | +42,829,835,316 | +
Sundanese | +5.0 MB | +263 | +547,145 | +
Swahili | +1.3 MB | +462 | +123,050 | +
Swedish | +48.0 GB | +7,541,278 | +5,078,331,128 | +
Tajik | +870.9 MB | +46,366 | +56,627,727 | +
Tamil | +11.4 GB | +556,772 | +452,343,748 | +
Tatar | +915.3 MB | +76,398 | +51,875,265 | +
Telugu | +3.4 GB | +249,756 | +137,752,065 | +
Thai | +66.1 GB | +5,030,254 | +1,626,779,846 | +
Tibetan | +234.5 MB | +18,683 | +2,286,269 | +
Turkish | +75.1 GB | +10,826,031 | +6,421,221,358 | +
Turkmen | +4.4 MB | +2,485 | +276,632 | +
Ukrainian | +48.8 GB | +4,558,214 | +2,879,585,992 | +
Emiliano-Romagnolo[eml] | +901 Bytes | +1 | +53 | +
Upper Sorbian | +132.8 kB | +110 | +8,825 | +
Urdu | +3.4 GB | +336,994 | +332,816,354 | +
Uyghur | +201.9 MB | +18,556 | +11,240,889 | +
Uzbek | +19.9 MB | +9,526 | +1,370,842 | +
Vietnamese | +98.9 GB | +9,587,233 | +12,283,185,482 | +
Volapük | +825.9 kB | +661 | +57,039 | +
Walloon | +105.7 kB | +138 | +4,386 | +
Waray | +7.6 MB | +933 | +830,872 | +
Welsh | +409.3 MB | +90,378 | +49,488,495 | +
Western Frisian | +75.3 MB | +21,946 | +6,357,929 | +
Western Mari | +743.5 kB | +155 | +43,916 | +
Western Panjabi | +46.7 MB | +6,790 | +4,060,419 | +
Wu Chinese | +137.2 kB | +88 | +3,056 | +
Yiddish | +232.5 MB | +23,418 | +15,809,780 | +
Yoruba | +24.7 kB | +26 | +1,042 | +
Multilingual | +12.1 GB | +1,210,685 | +936,187,711 | +
OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories. OSCAR 23.01 has also moved from gzip to Zstandard compression. You might already have zstd
installed on your system, but if not, please check the Zstandard website for installation instructions.
Tip
+OSCAR 23.01 is similar to OSCAR 22.01. As such, please also check out the documentation for OSCAR 22.01 if you need detailed information about metadata.
+Note
+If you already have access to the corpus, there's nothing to do! +Go up in the file hierarchy on the link you've been given, and you should find the new corpus.
+Access to the OSCAR Corpus changes depending on your status. More info on our dedicated page.
+ +OSCAR 22.01 leveraged the UT1 Blocklists project to attempt to classify some adult content present in OSCAR. +The OSCAR 23.01 pipeline iterated on this to include all of the blocklists provided by UT1.
+Warning
+The UT1 Blocklists page lists all the categories along with a short description. +We strongly encourage you to read the descriptions if you plan on using them. Please also note that these descriptions are in French. We're working on an English translation of them.
+Note
+A document can belong to multiple categories.
+These categories are in a field that is at this path: metadata.categories
.
Example
+ +For a select number of subcorpora, a measure of perplexity has been added. This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult
annotation in OSCAR 22.01.
+In other terms, the lower it is, the more likely a given document contains harmful/adult content.
Danger
+This feature can be considered as unstable/unsafe, since we also want to evaluate its impact on particular issues.
+As such, we do not provide a boolean value indicating if a given document can be harmful/adult content, but rather the raw perplexity. +We have found a threshold that works well in English, but encourage you to experiment with it and to report back your findings.
+We use TLSH to compute a hash for each document.
+Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.
+This can be used to do both exact- and near- deduplication. +Same documents have same hashes (the reverse might not be true). So you only need to check for identity amongst documents with identical hashes. +TLSH hashes can be compared to yield a distance metric. According to the original paper, a cutoff of < 40 yields a false positive rate of 0.07% and a detect rate of 49.6%, while a cutoff of < 100 yields a FP rate of 6.43% and detect rate of 94.5%. You should choose a value that meets your purposes.
+The above is true for the default version of TLSH which is used in packages such as py-tlsh
. OSCAR 23.01 uses a TLSH with a hyperparameter of 256 buckets (Full hash), and 3 byte checksums (collision rate : 1 in 5800) instead of 1 byte checksums (collision rate : 1 in 24).
If you would like to use py-tlsh
, follow these instructions (You need CMake
installed to perform the necessary modifications and build):
+
# download py-tlsh source package
+pip download python-tlsh
+# unpack the source tar.gz and enter the directory
+tar -xvf python-tlsh-4.5.0.tar.gz && cd python-tlsh-4.5.0
+# run the following command to implement the changes
+# alternatively, you can use vi or a text editor
+# change TLSH_BUCKETS_128 into TLSH_BUCKETS_256 and change TLSH_CHECKSUM_1B into TLSH_CHECKSUM_3B
+sed -i 's/set(TLSH_BUCKETS_128 1)/set(TLSH_BUCKETS_256 1)/g; s/set(TLSH_CHECKSUM_1B 1)/set(TLSH_CHECKSUM_3B 1)/g' CMakeLists.txt
+
+# build and activate pip venv if not already done
+# python3 -m venv ~/.venv
+source ~/.venv/bin/activate
+# build and install the new py-tlsh
+python3 setup.py install
+
Hashes are at metadata.tlsh
.
metadata.annotations
has been renamed metadata.quality_warnings
, and only contains length based quality warnings (see the OSCAR 2201 documentation for details).als
has become gsw
. Previously, als
was erroneously used as the tag for Alemannic/Swiss German, whereas it is the tag for Tosk Albanian.eml
has become x-eml
. The eml
tag is deprecated and as such has been replaced by a private tag (x-eml
).{
+ "content":"English sentence\nphrase en français\n????????????", // (1)
+ "warc_headers":{ // (2)
+ "warc-identified-content-language":"fra,eng",
+ "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
+ "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
+ "warc-type":"conversion",
+ "content-length":"35298", // (3)
+ "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
+ "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
+ "warc-date":"2022-11-26T09:45:47Z",
+ "content-type":"text/plain"
+ },
+ "metadata":{
+ "identification":{ // (4)
+ "label":"fr",
+ "prob":0.8938327
+ },
+ "harmful_pp":4063.1814, // (5)
+ "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
+ "quality_warnings":[ // (7)
+ "short_sentences",
+ "header",
+ "footer"
+ ],
+ "categories":[ // (8)
+ "examen_pix",
+ "liste_bu"
+ ],
+ "sentence_identifications":[ // (9)
+ {
+ "label":"fr",
+ "prob":0.99837273
+ },
+ {
+ "label":"en",
+ "prob":0.9992377
+ },
+ null
+ ]
+ }
+}
+
Some important notes:
+warc_headers
are copied and content can be altered by Ungoliant at generation stage, content-length
and warc-block-digest
can be different from actual values.harmful_pp
to harmful_ppl
in future releases.annotations
pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.null
value means no identification with a good enough threshold (>0.8 on 23.01).+ | Code | +Language | +# docs | +# words | +Content Length : | +
---|---|---|---|---|---|
0 | +af | +Afrikaans | +23,994 | +6,217,024 | +37.2 MB | +
1 | +sq | +Albanian | +1,342,790 | +462,694,599 | +3.2 GB | +
2 | +am | +Amharic | +119,434 | +40,262,809 | +512.9 MB | +
3 | +ar | +Arabic | +25,012,116 | +10,081,452,882 | +110.7 GB | +
4 | +an | +Aragonese | +34 | +264 | +11.0 kB | +
5 | +hy | +Armenian | +1,056,974 | +336,045,041 | +4.9 GB | +
6 | +as | +Assamese | +89,542 | +24,395,215 | +412.1 MB | +
7 | +ast | +Asturian | +440 | +10,917 | +74.1 kB | +
8 | +av | +Avaric | +44 | +1,073 | +18.6 kB | +
9 | +az | +Azerbaijani | +1,159,994 | +316,850,330 | +3.0 GB | +
10 | +bn | +Bangla | +3,474,086 | +1,092,983,765 | +19.1 GB | +
11 | +ba | +Bashkir | +128,248 | +26,036,637 | +363.7 MB | +
12 | +eu | +Basque | +678,474 | +136,672,615 | +1.2 GB | +
13 | +be | +Belarusian | +445,612 | +164,729,607 | +2.3 GB | +
14 | +bh | +Bihari languages | +48 | +507 | +6.8 kB | +
15 | +bpy | +Bishnupriya | +2,346 | +346,947 | +5.4 MB | +
16 | +bs | +Bosnian | +20 | +395 | +3.0 kB | +
17 | +br | +Breton | +36,338 | +4,759,407 | +31.4 MB | +
18 | +bg | +Bulgarian | +8,933,998 | +3,635,273,738 | +44.1 GB | +
19 | +my | +Burmese | +430,276 | +82,433,836 | +3.0 GB | +
20 | +ca | +Catalan | +6,953,898 | +2,240,460,836 | +15.3 GB | +
21 | +ceb | +Cebuano | +16,174 | +6,263,404 | +41.1 MB | +
22 | +ckb | +Central Kurdish | +182,508 | +61,334,746 | +772.9 MB | +
23 | +ce | +Chechen | +11,686 | +1,051,752 | +13.9 MB | +
24 | +zh | +Chinese | +138,478,270 | +44,378,380,161 | +1.4 TB | +
25 | +cv | +Chuvash | +16,652 | +3,039,925 | +42.3 MB | +
26 | +kw | +Cornish | +8 | +80 | +432 Bytes | +
27 | +hr | +Croatian | +31,808 | +3,542,961 | +26.5 MB | +
28 | +cs | +Czech | +34,859,632 | +9,717,378,559 | +77.0 GB | +
29 | +da | +Danish | +7,214,338 | +2,217,634,340 | +14.8 GB | +
30 | +dv | +Divehi | +77,060 | +10,655,359 | +200.1 MB | +
31 | +nl | +Dutch | +72,552,688 | +19,564,553,306 | +135.0 GB | +
32 | +mhr | +Eastern Mari | +9,502 | +1,615,215 | +22.9 MB | +
33 | +arz | +Egyptian Arabic | +3,958 | +385,511 | +3.7 MB | +
34 | +en | +English | +1,235,510,986 | +523,869,288,690 | +3.4 TB | +
35 | +eo | +Esperanto | +226,924 | +67,774,923 | +474.8 MB | +
36 | +et | +Estonian | +3,601,904 | +938,296,892 | +8.0 GB | +
37 | +tl | +Filipino | +250,558 | +110,560,444 | +719.2 MB | +
38 | +fi | +Finnish | +14,471,710 | +4,198,143,883 | +41.1 GB | +
39 | +fr | +French | +158,334,998 | +62,127,088,294 | +430.5 GB | +
40 | +gl | +Galician | +248,762 | +38,345,625 | +255.7 MB | +
41 | +ka | +Georgian | +1,343,036 | +373,935,158 | +8.4 GB | +
42 | +de | +German | +206,598,430 | +73,848,586,648 | +594.7 GB | +
43 | +gom | +Goan Konkani | +398 | +121,035 | +2.3 MB | +
44 | +el | +Greek | +20,282,864 | +7,691,622,692 | +95.7 GB | +
45 | +gn | +Guarani | +14 | +260 | +2.2 kB | +
46 | +gu | +Gujarati | +425,552 | +417,001,705 | +5.6 GB | +
47 | +ht | +Haitian Creole | +2 | +20,671 | +93.1 kB | +
48 | +he | +Hebrew | +3,997,888 | +1,697,158,891 | +18.0 GB | +
49 | +hi | +Hindi | +5,514,454 | +2,475,605,444 | +32.6 GB | +
50 | +hu | +Hungarian | +21,349,372 | +16,013,364,289 | +150.1 GB | +
51 | +is | +Icelandic | +1,210,232 | +294,471,539 | +2.2 GB | +
52 | +io | +Ido | +224 | +2,598 | +16.1 kB | +
53 | +ilo | +Iloko | +144 | +4,411 | +28.0 kB | +
54 | +id | +Indonesian | +7,109,778 | +3,228,020,221 | +23.4 GB | +
55 | +ia | +Interlingua | +34 | +9,384 | +33.5 kB | +
56 | +ie | +Interlingue | +2 | +0 | +881 Bytes | +
57 | +ga | +Irish | +29,894 | +9,054,923 | +63.2 MB | +
58 | +it | +Italian | +89,021,606 | +36,327,274,203 | +259.4 GB | +
59 | +ja | +Japanese | +94,236,404 | +4,401,059,165 | +181.2 GB | +
60 | +jv | +Javanese | +172 | +3,286 | +25.7 kB | +
61 | +xal | +Kalmyk | +2 | +27 | +315 Bytes | +
62 | +kn | +Kannada | +448,500 | +124,924,350 | +2.6 GB | +
63 | +krc | +Karachay-Balkar | +496 | +8,385 | +122.4 kB | +
64 | +kk | +Kazakh | +677,622 | +214,679,857 | +3.3 GB | +
65 | +km | +Khmer | +450,660 | +59,880,231 | +3.2 GB | +
66 | +kv | +Komi | +460 | +5,909 | +70.3 kB | +
67 | +ko | +Korean | +15,147,698 | +3,435,866,935 | +38.1 GB | +
68 | +ku | +Kurdish | +80,338 | +25,921,607 | +174.1 MB | +
69 | +ky | +Kyrgyz | +144,288 | +32,062,783 | +489.3 MB | +
70 | +lo | +Lao | +118,374 | +10,659,203 | +472.1 MB | +
71 | +la | +Latin | +14,384 | +307,865 | +2.0 MB | +
72 | +lv | +Latvian | +2,435,882 | +845,459,899 | +7.4 GB | +
73 | +lez | +Lezghian | +676 | +60,634 | +856.6 kB | +
74 | +li | +Limburgish | +6 | +169 | +1.4 kB | +
75 | +lt | +Lithuanian | +5,182,028 | +1,674,362,574 | +14.5 GB | +
76 | +jbo | +Lojban | +572 | +312,315 | +1.5 MB | +
77 | +lmo | +Lombard | +112 | +3,269 | +21.0 kB | +
78 | +nds | +Low German | +5,248 | +1,612,175 | +10.7 MB | +
79 | +dsb | +Lower Sorbian | +8 | +84 | +664 Bytes | +
80 | +lb | +Luxembourgish | +18,090 | +2,514,838 | +18.4 MB | +
81 | +mk | +Macedonian | +1,063,298 | +389,344,425 | +4.7 GB | +
82 | +mai | +Maithili | +46 | +467 | +6.8 kB | +
83 | +mg | +Malagasy | +10,830 | +1,416,430 | +11.2 MB | +
84 | +ms | +Malay | +11,500 | +238,477 | +2.6 MB | +
85 | +ml | +Malayalam | +800,936 | +236,597,838 | +5.8 GB | +
86 | +mt | +Maltese | +5,180 | +149,886 | +1.3 MB | +
87 | +mr | +Marathi | +729,578 | +252,706,331 | +4.5 GB | +
88 | +mzn | +Mazanderani | +384 | +16,115 | +169.2 kB | +
89 | +min | +Minangkabau | +2,436 | +305,589 | +3.8 MB | +
90 | +xmf | +Mingrelian | +7,318 | +283,316 | +6.1 MB | +
91 | +mwl | +Mirandese | +4 | +54 | +423 Bytes | +
92 | +mn | +Mongolian | +1,061,710 | +454,350,415 | +5.8 GB | +
93 | +multi | +Multilingual | +2,948,202 | +1,251,676,406 | +11.9 GB | +
94 | +nah | +Nahuatl languages | +38 | +279 | +2.4 kB | +
95 | +ne | +Nepali | +1,152,156 | +278,901,036 | +4.9 GB | +
96 | +new | +Newari | +1,996 | +229,703 | +4.0 MB | +
97 | +no | +Norwegian | +2,797,378 | +373,160,033 | +2.6 GB | +
98 | +nn | +Norwegian Nynorsk | +19,470 | +575,518 | +3.7 MB | +
99 | +oc | +Occitan | +920 | +34,701 | +405.0 kB | +
100 | +or | +Odia | +158,426 | +31,963,340 | +543.1 MB | +
101 | +os | +Ossetic | +8,628 | +3,935,964 | +50.7 MB | +
102 | +ps | +Pashto | +87,408 | +30,196,179 | +261.6 MB | +
103 | +fa | +Persian | +23,813,882 | +9,609,206,698 | +93.2 GB | +
104 | +pms | +Piedmontese | +2,524 | +510,087 | +3.1 MB | +
105 | +pl | +Polish | +57,184,826 | +18,073,705,588 | +147.1 GB | +
106 | +pt | +Portuguese | +36,062,800 | +15,172,557,311 | +105.0 GB | +
107 | +pa | +Punjabi | +222,058 | +104,235,418 | +1.4 GB | +
108 | +qu | +Quechua | +2 | +13 | +143 Bytes | +
109 | +ro | +Romanian | +11,985,668 | +6,302,600,833 | +45.6 GB | +
110 | +bxr | +Russia Buriat | +72 | +698 | +8.2 kB | +
111 | +ru | +Russian | +194,143,422 | +78,032,029,344 | +1.1 TB | +
112 | +sah | +Sakha | +17,566 | +4,288,051 | +68.8 MB | +
113 | +sa | +Sanskrit | +16,802 | +2,479,345 | +56.3 MB | +
114 | +gd | +Scottish Gaelic | +776 | +18,458 | +146.1 kB | +
115 | +sr | +Serbian | +1,677,896 | +632,781,822 | +7.7 GB | +
116 | +sh | +Serbian (Latin) | +3,214 | +166,517 | +816.4 kB | +
117 | +sd | +Sindhi | +48,566 | +14,667,207 | +131.6 MB | +
118 | +si | +Sinhala | +301,066 | +172,755,385 | +2.6 GB | +
119 | +sk | +Slovak | +8,931,784 | +2,704,716,280 | +21.5 GB | +
120 | +sl | +Slovenian | +1,112,560 | +192,816,743 | +1.4 GB | +
121 | +so | +Somali | +6 | +51 | +503 Bytes | +
122 | +azb | +South Azerbaijani | +26,364 | +2,029,729 | +28.4 MB | +
123 | +es | +Spanish | +153,574,556 | +63,388,237,965 | +429.9 GB | +
124 | +su | +Sundanese | +18 | +258 | +2.0 kB | +
125 | +sw | +Swahili | +1,664 | +164,459 | +1.0 MB | +
126 | +sv | +Swedish | +21,891,348 | +6,993,719,601 | +50.0 GB | +
127 | +gsw | +Swiss German | +342 | +34,328 | +232.7 kB | +
128 | +tg | +Tajik | +144,932 | +76,987,285 | +1.0 GB | +
129 | +ta | +Tamil | +1,638,238 | +738,824,392 | +15.8 GB | +
130 | +tt | +Tatar | +262,654 | +59,253,765 | +833.8 MB | +
131 | +te | +Telugu | +644,712 | +201,575,815 | +3.9 GB | +
132 | +th | +Thai | +14,845,900 | +2,224,483,018 | +92.0 GB | +
133 | +bo | +Tibetan | +62,352 | +6,062,558 | +531.6 MB | +
134 | +tr | +Turkish | +26,654,330 | +8,290,890,087 | +73.7 GB | +
135 | +tk | +Turkmen | +4,576 | +325,786 | +3.3 MB | +
136 | +uk | +Ukrainian | +10,059,992 | +3,183,842,018 | +44.7 GB | +
137 | +x-eml | +Emiliano-Romagnol | +4 | +329 | +1.8 kB | +
138 | +hsb | +Upper Sorbian | +402 | +15,827 | +123.2 kB | +
139 | +ur | +Urdu | +887,004 | +434,023,273 | +3.8 GB | +
140 | +ug | +Uyghur | +51,304 | +14,659,554 | +219.8 MB | +
141 | +uz | +Uzbek | +15,806 | +1,665,960 | +15.3 MB | +
142 | +vi | +Vietnamese | +33,933,994 | +22,424,984,210 | +140.8 GB | +
143 | +vo | +Volapük | +896 | +49,968 | +371.9 kB | +
144 | +wa | +Walloon | +390 | +6,347 | +34.3 kB | +
145 | +war | +Waray | +1,494 | +19,665 | +126.8 kB | +
146 | +cy | +Welsh | +151,512 | +52,250,043 | +333.0 MB | +
147 | +fy | +Western Frisian | +45,458 | +9,885,788 | +70.4 MB | +
148 | +mrj | +Western Mari | +496 | +60,180 | +765.8 kB | +
149 | +pnb | +Western Panjabi | +12,904 | +11,844,695 | +105.8 MB | +
150 | +wuu | +Wu Chinese | +136 | +1,199 | +26.8 kB | +
151 | +yi | +Yiddish | +47,438 | +14,287,370 | +171.7 MB | +
152 | +yo | +Yoruba | +128 | +2,396 | +16.6 kB | +