OSCAR Statistics

This is an experimental package to compute statistics on the OSCAR corpus releases. For the moment it only computes the statistics for a single snapshot that you have to specify as an argument. Computes the following statistics per language:

Number of documents
Number of tokens
Number of bytes
Number of characters

The output is a parquet file.

Usage

➜ ./target/release/oscar-statistics -h
Compute statistics of an OSCAR release

Usage: oscar-statistics [OPTIONS] <INPUT FOLDER> <DESTINATION FILE> <SNAPSHOT>

Arguments:
  <INPUT FOLDER>      Folder containing the indices
  <DESTINATION FILE>  Parquet file to write
  <SNAPSHOT>          Name of the snapshot

Options:
  -t, --threads <NUMBER OF THREADS>  Number of threads to use [default: 10]
  -h, --help                         Print help
  -V, --version                      Print version

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

OSCAR Statistics

Usage

About

Licenses found

Releases

Packages

Languages

License

Licenses found

oscar-project/oscar-statistics

Folders and files

Latest commit

History

Repository files navigation

OSCAR Statistics

Usage

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages