Dataset Stats

Code to analyze a dataset and save dataset statistics and plot relevant histograms. The statistics are saved in a .tex file for easy import to your LaTex project.

Statistics Computed:

Number of samples
Least number of words per sample
Average number of words per sample
Most number of words per sample
Least number of characters per word
Average number of characters per word
Most number of characters per word
Number of unique words (vocabulary size)
Percentage of characters that are punctuation

Other stats like total number of words, total number of characters etc. are computed, but are not saved to disk as they are not as informative. Extend the Stats class to save other statistics.

Histograms saved:

Histogram of number of words per sample
Histogram of number of characters per word

To analyze a dataset, run:

python analyze.py --dataset_path <path_to_dataset_file_or_folder> --output_folder <path_to_output_folder>

The --dataset_path argument takes either a file path or a folder path. If a file is passed, then the statistics for that file are computed and saved as a .tex file. The histograms are also plotted and saved. --dataset_path should only point to a folder if the folder contains two files of the same dataset, i.e. the source and target files for sequence-to-sequence tasks. In this case, the statistics of the two files are saved in the same table, and the histograms are plotted in the same figure. All outputs are saved in the path provided via --output_folder. If no path is provided, then outputs are saved in the current directory.

The input dataset files should be named with the source/target information in the extensions. For example, for the wmt machine translation dataset, the files containing the German and English sentences should be named as follows:

wmt.german
wmt.english

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
stats.py		stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Stats

Statistics Computed:

Histograms saved:

To analyze a dataset, run:

About

Uh oh!

Releases

Packages

Languages

License

nishprabhu/dataset_stats

Folders and files

Latest commit

History

Repository files navigation

Dataset Stats

Statistics Computed:

Histograms saved:

To analyze a dataset, run:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages