Fast Edit Distance Using Big Data Prefix Trees

Build prefix trees on a big-data platform using a lightning-fast method and use prefix trees to implement fast edit-distance algorithms.

There are three forms of the code included.

The first form is self-contained and run on an HPCC Systems Thor using the in-line dataset in the example. Use this one to understand what is going on with prefix trees and how to query them in Thor using an edit-distance algorithm.

The second form is mostly the same, but I've written it to take a data file from a Thor, build the prefix tree and then use the same data file to query the prefix tree. You will need to edit this code for your own data file. I tested this on a 1.7 million-record dataset with a 21-node Thor using slower spindle storage drives. The prefix tree builds in about 5 seconds. The 1.7 million-record dataset is used again to walk the prefix tree in about 45 minutes. That’s the equivalent of taking two 1.7 million-record datasets and finding all edit-distance candidates between them (think Cartesian join) in 45 minutes. The naive approach would have to churn through almost 3 trillion candidate pairs. This approach is orders of magnitude faster. You can easily edit this example to use different datasets to build and separately query your prefix tree.

The final form of the code is similar to the other two examples, but I broke the code out into an example Thor job and Roxie query. Like the second form above, you will need to edit this code for your own data file. The Thor job builds the prefix-tree. The Roxie query queries the prefix tree interactively and reports back performance data. Also include is a very simple python script I wrote to query the prefix tree and collect performance data. I ran a thousand queries on three separate runs using a single-node Roxie with spindle disk drives. On the final run, the average performance time was 24.7 milliseconds. The standard deviation for the run was 7.2 milliseconds. You can see the performance in the second blog post below.

Finally, drop me a line if you've found the blog posts or the code useful.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
PruneUsingMaxMinChildWordLength		PruneUsingMaxMinChildWordLength
1_ThorInlineDataset.ecl		1_ThorInlineDataset.ecl
2_ThorFile.ecl		2_ThorFile.ecl
3_LoadTest_PythonScript.py		3_LoadTest_PythonScript.py
3_RoxieQuery.ecl		3_RoxieQuery.ecl
3_ThorIndexBuild.ecl		3_ThorIndexBuild.ecl
README.md		README.md
tree_graphic.html		tree_graphic.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Edit Distance Using Big Data Prefix Trees

About

Releases

Packages

Languages

Charles-Kaminski/FastEditDistanceUsingBigDataPrefixTrees

Folders and files

Latest commit

History

Repository files navigation

Fast Edit Distance Using Big Data Prefix Trees

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages