Skip to content

Commit d52bf78

Browse files
committedFeb 3, 2023
begin quickstart guide
1 parent fa51cec commit d52bf78

File tree

2 files changed

+91
-2
lines changed

2 files changed

+91
-2
lines changed
 

‎docs/index.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
1-
## OSCAR Corpus Documentation
1+
## OSCAR Project
2+
3+
Welcome to the OSCAR Project documentation!
4+
5+
The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed [high-performance data pipelines](https://github.com/oscar-project/ungoliant) specifically conceived to classify and filter large amounts of [web data](https://commoncrawl.org). The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.
6+
7+
8+
This documentation aims to provide a global view of the project, from getting the data, to contributing.
9+
210

3-
Welcome to the OSCAR Corpus Documentation!
411
This website aims to gather information about the corpus in a technical point of view:
512

613
- Corpus versions and their respective file formats.

‎docs/quickstart.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# OSCAR Quickstart
2+
3+
## What is OSCAR?
4+
OSCAR is a collection of web-based multilingual corpus of several terabytes, containing subcorpora in more than 150 languages.
5+
6+
Each OSCAR Corpus has a version name that tells you its approximate generation time, which usually coincides with the source crawl time.
7+
The latest OSCAR Corpus is **OSCAR 2301**.
8+
We advise you to always use the latest version, as we incrementally include new features that enables new ways of filtering the corpus for your applications.
9+
10+
## Basic data layout
11+
12+
OSCAR is, since **OSCAR 2109**, **document-oriented**, which means that subcorpora are comprised of documents rather than individual lines.
13+
14+
**This has important implications as to how to preprocess the data:**
15+
16+
You can (and will) find sentences in other languages than the one you're interested in. For example, it is expected to encounter **English** sentences in documents from the **French** subcorpus.
17+
18+
!!! example
19+
The Wikipedia article about the French anthem, [La Marseillaise](https://en.wikipedia.org/wiki/La_Marseillaise), contains its lyrics in French.
20+
As such, this article is expected to be present in the **English** subcorpus with those **French** lyrics.
21+
22+
The good news is that you can easily remove those sentences if you are not interested in them, thanks to the metadata provided alongside the main content.
23+
24+
OSCAR is distributed in [JSONLines](https://jsonlines.org/) files, usually compressed ([`gzip`](https://www.gnu.org/software/gzip/), [`zstd`](https://facebook.github.io/zstd/) depending on the version).
25+
26+
Each line of a file is a JSON Object representing a single document.
27+
Here is an example from OSCAR 2301:
28+
29+
```js
30+
{
31+
"content":"English sentence\nphrase en français\n????????????", // (1)
32+
"warc_headers":{ // (2)
33+
"warc-identified-content-language":"fra,eng",
34+
"warc-target-uri":"https://fr.wikipedia.org/wiki/...",
35+
"warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
36+
"warc-type":"conversion",
37+
"content-length":"35298", // (3)
38+
"warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
39+
"warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
40+
"warc-date":"2022-11-26T09:45:47Z",
41+
"content-type":"text/plain"
42+
},
43+
"metadata":{
44+
"identification":{ // (4)
45+
"label":"fr",
46+
"prob":0.8938327
47+
},
48+
"harmful_pp":4063.1814, // (5)
49+
"tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
50+
"quality_warnings":[ // (7)
51+
"short_sentences",
52+
"header",
53+
"footer"
54+
],
55+
"categories":[ // (8)
56+
"examen_pix",
57+
"liste_bu"
58+
],
59+
"sentence_identifications":[ // (9)
60+
{
61+
"label":"fr",
62+
"prob":0.99837273
63+
},
64+
{
65+
"label":"en",
66+
"prob":0.9992377
67+
},
68+
null
69+
]
70+
}
71+
}
72+
```
73+
74+
1. Newline-separated content.
75+
2. Headers from the crawled dumps, untouched. See the [WARC specification](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#named-fields) for more info.
76+
3. Since `warc_headers` are copied and content can be altered by [Ungoliant](TODO), `content-length` and `warc-block-digest` can be different from actual values.
77+
4. Document-level identification. Computation details can be found [here](todo).
78+
5. TODO
79+
6. Locality Sensitive Hash of the documents' content, using [TLSH](https://tlsh.org/). Useful for both exact and near deduplication.
80+
7. _(Corresponds to `annotations` pre-2301)_ Potential quality warnings. Based on content/sentence length. See [here]() for more info.
81+
8. Blocklist-bsaed categories. Uses the [UT1 Blocklist](https://dsi.ut-capitole.fr/blacklists/index_en.php), plus custom additions (TODO). Please refer to the UT1 website for categories description.
82+
9. Sentence-level identifications. A `null` value means no identification with a good enough threshold (>0.8 on 2301).

0 commit comments

Comments
 (0)