Skip to content

Commit

Permalink
Merge pull request #1 from andhus/dirhash_standard
Browse files Browse the repository at this point in the history
Implementation based on the Dirhash Standard
  • Loading branch information
andhus authored Apr 20, 2020
2 parents c3362a7 + aa4cd7f commit 51ec5af
Show file tree
Hide file tree
Showing 10 changed files with 1,630 additions and 865 deletions.
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
NIL

## [0.2.0] - 2019-04-20
Complies with [Dirhash Standard](https://github.com/andhus/dirhash) Version [0.1.0](https://github.com/andhus/dirhash/releases/v0.1.0)

### Added
- A first implementation based on the formalized [Dirhash Standard](https://github.com/andhus/dirhash).
- This changelog.
- Results form a new benchmark run after changes. The `benchmark/run.py` now outputs results files which names include the `dirhash.__version__`.

### Changed
- **Significant breaking changes** from version 0.1.1 - both regarding API and the
underlying method/protocol for computing the hash. This means that **hashes
computed with this version will differ from hashes computed with version < 0.2.0 for
same directory**.
- This dirhash python implementation has moved to here
[github.com/andhus/dirhash-python](https://github.com/andhus/dirhash-python) from
the previous repository
[github.com/andhus/dirhash](https://github.com/andhus/dirhash)
which now contains the formal description of the Dirhash Standard.

### Removed
- All support for the `.dirhashignore` file. This seemed superfluous, please file an
issue if you need this feature.
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
[![Build Status](https://travis-ci.com/andhus/dirhash.svg?branch=master)](https://travis-ci.com/andhus/dirhash)
[![codecov](https://codecov.io/gh/andhus/dirhash/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash)
[![Build Status](https://travis-ci.com/andhus/dirhash-python.svg?branch=master)](https://travis-ci.com/andhus/dirhash-python)
[![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python)

# dirhash
A lightweight python module and tool for computing the hash of any
A lightweight python module and CLI for computing the hash of any
directory based on its files' structure and content.
- Supports any hashing algorithm of Python's built-in `hashlib` module
- `.gitignore` style "wildmatch" patterns for expressive filtering of files to
include/exclude.
- Supports all hashing algorithms of Python's built-in `hashlib` module.
- Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude.
- Multiprocessing for up to [6x speed-up](#performance)

The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.

## Installation
From PyPI:
```commandline
pip install dirhash
```
Or directly from source:
```commandline
git clone git@github.com:andhus/dirhash.git
git clone git@github.com:andhus/dirhash-python.git
pip install dirhash/
```

Expand All @@ -25,16 +26,16 @@ Python module:
```python
from dirhash import dirhash

dirpath = 'path/to/directory'
dir_md5 = dirhash(dirpath, 'md5')
filtered_sha1 = dirhash(dirpath, 'sha1', ignore=['.*', '.*/', '*.pyc'])
pyfiles_sha3_512 = dirhash(dirpath, 'sha3_512', match=['*.py'])
dirpath = "path/to/directory"
dir_md5 = dirhash(dirpath, "md5")
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])
```
CLI:
```commandline
dirhash path/to/directory -a md5
dirhash path/to/directory -a sha1 -i ".* .*/ *.pyc"
dirhash path/to/directory -a sha3_512 -m "*.py"
dirhash path/to/directory -a md5 --match "*.py"
dirhash path/to/directory -a sha1 --ignore ".*" ".*/"
```

## Why?
Expand Down Expand Up @@ -66,7 +67,7 @@ and executing `hashlib` code.
The main effort to boost performance is support for multiprocessing, where the
reading and hashing is parallelized over individual files.

As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash/blob/master/dirhash/cli.py)
As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/cli.py)
with the shell command:

`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5`
Expand All @@ -87,7 +88,7 @@ shell reference | nested_32k_32kB | 6.82 | -> 1.0
`dirhash` | nested_32k_32kB | 3.43 | 2.00
`dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00**

The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash/tree/master/benchmark).
The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/benchmark).

## Documentation
Please refer to `dirhash -h` and the python [source code](https://github.com/andhus/dirhash/blob/master/dirhash/__init__.py).
Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash/dirhash-python/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash).
51 changes: 51 additions & 0 deletions benchmark/results_v0.2.0.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
,test_case,implementation,algorithm,workers,t_best,t_median,speed-up (median)
0,flat_8_128MB,shell reference,md5,1,2.079,2.083,1.0
1,flat_8_128MB,dirhash_impl,md5,1,1.734,1.945,1.0709511568123393
2,flat_8_128MB,dirhash_impl,md5,2,0.999,1.183,1.760777683854607
3,flat_8_128MB,dirhash_impl,md5,4,0.711,0.728,2.8612637362637368
4,flat_8_128MB,dirhash_impl,md5,8,0.504,0.518,4.021235521235521
5,flat_1k_1MB,shell reference,md5,1,3.383,3.679,1.0
6,flat_1k_1MB,dirhash_impl,md5,1,1.846,1.921,1.9151483602290473
7,flat_1k_1MB,dirhash_impl,md5,2,1.137,1.158,3.1770293609671847
8,flat_1k_1MB,dirhash_impl,md5,4,0.74,0.749,4.911882510013351
9,flat_1k_1MB,dirhash_impl,md5,8,0.53,0.534,6.889513108614231
10,flat_32k_32kB,shell reference,md5,1,13.827,18.213,1.0
11,flat_32k_32kB,dirhash_impl,md5,1,13.655,13.808,1.3190179606025494
12,flat_32k_32kB,dirhash_impl,md5,2,3.276,3.33,5.469369369369369
13,flat_32k_32kB,dirhash_impl,md5,4,2.409,2.421,7.522924411400249
14,flat_32k_32kB,dirhash_impl,md5,8,2.045,2.086,8.731064237775648
15,nested_1k_1MB,shell reference,md5,1,3.284,3.332,1.0
16,nested_1k_1MB,dirhash_impl,md5,1,1.717,1.725,1.9315942028985504
17,nested_1k_1MB,dirhash_impl,md5,2,1.026,1.034,3.222437137330754
18,nested_1k_1MB,dirhash_impl,md5,4,0.622,0.633,5.263823064770932
19,nested_1k_1MB,dirhash_impl,md5,8,0.522,0.529,6.29867674858223
20,nested_32k_32kB,shell reference,md5,1,11.898,12.125,1.0
21,nested_32k_32kB,dirhash_impl,md5,1,13.858,14.146,0.8571327583769263
22,nested_32k_32kB,dirhash_impl,md5,2,2.781,2.987,4.059256779377302
23,nested_32k_32kB,dirhash_impl,md5,4,1.894,1.92,6.315104166666667
24,nested_32k_32kB,dirhash_impl,md5,8,1.55,1.568,7.732780612244897
25,flat_8_128MB,shell reference,sha1,1,2.042,2.05,1.0
26,flat_8_128MB,dirhash_impl,sha1,1,1.338,1.354,1.5140324963072376
27,flat_8_128MB,dirhash_impl,sha1,2,0.79,0.794,2.5818639798488663
28,flat_8_128MB,dirhash_impl,sha1,4,0.583,0.593,3.456998313659359
29,flat_8_128MB,dirhash_impl,sha1,8,0.483,0.487,4.209445585215605
30,flat_1k_1MB,shell reference,sha1,1,2.118,2.129,1.0
31,flat_1k_1MB,dirhash_impl,sha1,1,1.39,1.531,1.3905943827563685
32,flat_1k_1MB,dirhash_impl,sha1,2,0.925,0.932,2.2843347639484977
33,flat_1k_1MB,dirhash_impl,sha1,4,0.614,0.629,3.384737678855326
34,flat_1k_1MB,dirhash_impl,sha1,8,0.511,0.52,4.094230769230769
35,flat_32k_32kB,shell reference,sha1,1,10.551,10.97,1.0
36,flat_32k_32kB,dirhash_impl,sha1,1,4.663,4.76,2.304621848739496
37,flat_32k_32kB,dirhash_impl,sha1,2,3.108,3.235,3.3910355486862445
38,flat_32k_32kB,dirhash_impl,sha1,4,2.342,2.361,4.6463362981787375
39,flat_32k_32kB,dirhash_impl,sha1,8,2.071,2.094,5.2387774594078325
40,nested_1k_1MB,shell reference,sha1,1,2.11,2.159,1.0
41,nested_1k_1MB,dirhash_impl,sha1,1,1.436,1.47,1.4687074829931972
42,nested_1k_1MB,dirhash_impl,sha1,2,0.925,0.937,2.3041622198505864
43,nested_1k_1MB,dirhash_impl,sha1,4,0.627,0.643,3.357698289269051
44,nested_1k_1MB,dirhash_impl,sha1,8,0.516,0.527,4.096774193548386
45,nested_32k_32kB,shell reference,sha1,1,3.982,7.147,1.0
46,nested_32k_32kB,dirhash_impl,sha1,1,4.114,4.156,1.7196823869104911
47,nested_32k_32kB,dirhash_impl,sha1,2,2.598,2.616,2.7320336391437308
48,nested_32k_32kB,dirhash_impl,sha1,4,1.809,1.831,3.9033315128345167
49,nested_32k_32kB,dirhash_impl,sha1,8,1.552,1.58,4.523417721518987
Loading

0 comments on commit 51ec5af

Please sign in to comment.