editops

Character and word alignment based analysis of unicode string pairs using Levenshtein edit distance and operations.

Lazily evaluated OO interface avoids full alignment calculation until necessary while cheaply providing typical error measures (word and character error rates) or more obscure measures (e.g. word information loss, match error rate).

Computes saliency-weighted analogs of word level error measures.

Aggregates word/n-gram level statistics (e.g. precision, recall).

Prints visually useful representations.

Installation

Installs via pip:

git clone https://github.com/ctogle/editops
cd editops
pip install .

Usage

CLI

Prints weighted average of WER/CER over all samples
Input can be text or files containing text (--hyp and --ref)
Handles 1+ pairs of strings to compare (--lines option)
Optionally displays full alignment/WER/CER for each sample (--verbose)
Optionally stores full analysis in jsonl output file (--output)

python -m editops.analyze --help

Python

hyp = "i'm just a string!"
ref = "i'm also just a string?"
from editops import Alignment
a = Alignment(hyp, ref, color=True)
print(a)
print(a.CER, a.WER, a.NWER, a.MER, a.WIL)

Performance

The speed of the edit distance calculation is compared to other available python libraries.

a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'

from edit_distance import SequenceMatcher
%timeit SequenceMatcher(a=a, b=b).distance()
# 1000 loops, best of 3: 1.18 ms per loop

from pylev import levenshtein as pylev_distance
%timeit pylev_distance(a, b)
# 1000 loops, best of 3: 260 µs per loop

from pyxdameraulevenshtein import damerau_levenshtein_distance as pyxdameraulevenshtein_distance
%timeit pyxdameraulevenshtein_distance(a, b)
# 10000 loops, best of 3: 73.6 µs per loop

from Levenshtein import distance as levenshtein_distance
%timeit levenshtein_distance(a, b)
# 100000 loops, best of 3: 2.02 µs per loop

from editdistance import eval as editdistance_distance
%timeit editdistance_distance(a, b)
# 1000000 loops, best of 3: 1.79 µs per loop

from editops import editdistance as editops_distance
%timeit editops_distance(a, b)
# 100000 loops, best of 3: 6.38 µs per loop

The edit distance calculation of editops is faster than that of all but python-Levenshtein and editdistance, though editops also exposes the set of edit operations via the method editops. python-Levenshtein and edit_distance expose this information, though editops is significantly faster than edit_distance and more liberally licensed than python-Levenshtein.

a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'

from edit_distance import SequenceMatcher
%timeit SequenceMatcher(a=a, b=b).get_opcodes()
# 1000 loops, best of 3: 1.64 ms per loop

from Levenshtein import editops as levenshtein_editops
%timeit levenshtein_editops(a, b)
# 100000 loops, best of 3: 3.22 µs per loop

from editops import editops as editops_editops
%timeit editops_editops(a, b)
# 100000 loops, best of 3: 7.56 µs per loop

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
editops		editops
tests		tests
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

editops

Installation

Usage

CLI

Python

Performance

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ctogle/editops

Folders and files

Latest commit

History

Repository files navigation

editops

Installation

Usage

CLI

Python

Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages