|
| 1 | +# OSCAR Quickstart |
| 2 | + |
| 3 | +## What is OSCAR? |
| 4 | +OSCAR is a collection of web-based multilingual corpus of several terabytes, containing subcorpora in more than 150 languages. |
| 5 | + |
| 6 | +Each OSCAR Corpus has a version name that tells you its approximate generation time, which usually coincides with the source crawl time. |
| 7 | +The latest OSCAR Corpus is **OSCAR 2301**. |
| 8 | +We advise you to always use the latest version, as we incrementally include new features that enables new ways of filtering the corpus for your applications. |
| 9 | + |
| 10 | +## Basic data layout |
| 11 | + |
| 12 | +OSCAR is, since **OSCAR 2109**, **document-oriented**, which means that subcorpora are comprised of documents rather than individual lines. |
| 13 | + |
| 14 | +**This has important implications as to how to preprocess the data:** |
| 15 | + |
| 16 | +You can (and will) find sentences in other languages than the one you're interested in. For example, it is expected to encounter **English** sentences in documents from the **French** subcorpus. |
| 17 | + |
| 18 | +!!! example |
| 19 | + The Wikipedia article about the French anthem, [La Marseillaise](https://en.wikipedia.org/wiki/La_Marseillaise), contains its lyrics in French. |
| 20 | + As such, this article is expected to be present in the **English** subcorpus with those **French** lyrics. |
| 21 | + |
| 22 | + The good news is that you can easily remove those sentences if you are not interested in them, thanks to the metadata provided alongside the main content. |
| 23 | + |
| 24 | +OSCAR is distributed in [JSONLines](https://jsonlines.org/) files, usually compressed ([`gzip`](https://www.gnu.org/software/gzip/), [`zstd`](https://facebook.github.io/zstd/) depending on the version). |
| 25 | + |
| 26 | +Each line of a file is a JSON Object representing a single document. |
| 27 | +Here is an example from OSCAR 2301: |
| 28 | + |
| 29 | +```js |
| 30 | +{ |
| 31 | + "content":"English sentence\nphrase en français\n????????????", // (1) |
| 32 | + "warc_headers":{ // (2) |
| 33 | + "warc-identified-content-language":"fra,eng", |
| 34 | + "warc-target-uri":"https://fr.wikipedia.org/wiki/...", |
| 35 | + "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>", |
| 36 | + "warc-type":"conversion", |
| 37 | + "content-length":"35298", // (3) |
| 38 | + "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>", |
| 39 | + "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3) |
| 40 | + "warc-date":"2022-11-26T09:45:47Z", |
| 41 | + "content-type":"text/plain" |
| 42 | + }, |
| 43 | + "metadata":{ |
| 44 | + "identification":{ // (4) |
| 45 | + "label":"fr", |
| 46 | + "prob":0.8938327 |
| 47 | + }, |
| 48 | + "harmful_pp":4063.1814, // (5) |
| 49 | + "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6) |
| 50 | + "quality_warnings":[ // (7) |
| 51 | + "short_sentences", |
| 52 | + "header", |
| 53 | + "footer" |
| 54 | + ], |
| 55 | + "categories":[ // (8) |
| 56 | + "examen_pix", |
| 57 | + "liste_bu" |
| 58 | + ], |
| 59 | + "sentence_identifications":[ // (9) |
| 60 | + { |
| 61 | + "label":"fr", |
| 62 | + "prob":0.99837273 |
| 63 | + }, |
| 64 | + { |
| 65 | + "label":"en", |
| 66 | + "prob":0.9992377 |
| 67 | + }, |
| 68 | + null |
| 69 | + ] |
| 70 | + } |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +1. Newline-separated content. |
| 75 | +2. Headers from the crawled dumps, untouched. See the [WARC specification](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#named-fields) for more info. |
| 76 | +3. Since `warc_headers` are copied and content can be altered by [Ungoliant](TODO), `content-length` and `warc-block-digest` can be different from actual values. |
| 77 | +4. Document-level identification. Computation details can be found [here](todo). |
| 78 | +5. TODO |
| 79 | +6. Locality Sensitive Hash of the documents' content, using [TLSH](https://tlsh.org/). Useful for both exact and near deduplication. |
| 80 | +7. _(Corresponds to `annotations` pre-2301)_ Potential quality warnings. Based on content/sentence length. See [here]() for more info. |
| 81 | +8. Blocklist-bsaed categories. Uses the [UT1 Blocklist](https://dsi.ut-capitole.fr/blacklists/index_en.php), plus custom additions (TODO). Please refer to the UT1 website for categories description. |
| 82 | +9. Sentence-level identifications. A `null` value means no identification with a good enough threshold (>0.8 on 2301). |
0 commit comments