Skip to content

Abimbojolo/string_grouper

 
 

Repository files navigation

String Grouper

pypi license lastcommit codecov PyPI Downloads

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.


string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Speed

string_grouper leverages the blazingly fast sparse_dot_topn libary to calculate cosine similarities.

s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)

e = datetime.datetime.now()
diff = (e - s)
str(diff)

Results in:

00:05:34.65 On an Intel i7-6500U CPU @ 2.50GHz, where len(names) = 663 000

in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.

Example: Matching Records Between Two Lists (Master vs New Data)

A common real-world scenario involves matching new records (such as new customer registrations, new product feeds, or new vendor data) against an existing master dataset to identify potential duplicates or fuzzy matches. string_grouper supports this by comparing two lists directly.

import pandas as pd
from string_grouper import match_strings

# Existing master customer list
master_customers = pd.Series([
    "Fresh Mart Superstore",
    "Green Valley Grocers",
    "Daily Needs Market",
    "Quick Stop Convenience"
])

# New incoming customer records
new_customers = pd.Series([
    "Green Valley Grocery",
    "Daily Needz Market",
    "Quick-Stop Convenience",
    "Completely New Store"
])

# Find fuzzy matches between the master list and new list
matches = match_strings(
    master=master_customers,
    duplicates=new_customers,
    min_similarity=0.8,
)

# Inspect the top matches
print(matches.head())



## Simple Match

```python
import pandas as pd
from string_grouper import match_strings

company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index left_Company Name similarity right_Company Name right_index
15 14 0210, LLC 0.870291 90210 LLC 4211
167 165 1 800 MUTUALS ADVISOR SERIES 0.931615 1 800 MUTUALS ADVISORS SERIES 166
168 166 1 800 MUTUALS ADVISORS SERIES 0.931615 1 800 MUTUALS ADVISOR SERIES 165
172 168 1 800 RADIATOR FRANCHISE INC 1 1-800-RADIATOR FRANCHISE INC. 201
178 173 1 FINANCIAL MARKETPLACE SECURITIES LLC /BD 0.949364 1 FINANCIAL MARKETPLACE SECURITIES, LLC 174

Group Similar Strings and Find most Common

companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
name_deduped Line Number
ADVISORS DISCIPLINED TRUST 1747
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1 916
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200 652
U S TECHNOLOGIES INC 632
CAPITAL MANAGEMENT LLC 628
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 611
E ACQUISITION CORP 561
CAPITAL PARTNERS LP 561
FIRST TRUST COMBINED SERIES 1 560
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20 544

Documentation

The documentation can be found here

About

Super Fast String Matching in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%