FMWS (Fuzzy Matching With Seeds)

What is it?

A command-line interface that takes in a FASTA file and a sequence, then finds the best matching substrings of each sequence in the FASTA file using fuzzy matching and using Levenshtein distance and Sørensen–Dice coefficient as metrics.

Why?

Fuzzy matching is the process of finding and matching an approximate pattern across strings. The concept started back in 1920, with the many-valued logic proposed by Jan Łukasiewicz, and then followed the evolution of translation memory tools leading up to the 1990's.

Now, fuzzy matching is used in all sorts of applications, including computer assisted translation (CAT) and spell checking. In bioinformatics, fuzzy matching is used for approximate string matching on large DNA sequences. This is what our FMWS tool is meant to facilitate.

Motivation

This project was created for Northern Arizona University's CS 499 class in Bioinformatics (Spring 2019 term). The primary driving force was proposing our own idea and implementation as well as having the flexibility to do as we please. We also proposed this idea to give us the opportunity to personalize our solution.

Main Logic

Parse command-line arguments
Read in FASTA file
Extract k-mers from sequence
Compare each k-mer to the given substring (parallelized)
Store the best substring of each sequence
Sort rankings
Output smallest distances, in order

Use Cases

Different use cases for fuzzy matching range from searching for a list based on relevance to providing recommendations to replace potentially misspelled search words.

When running our program with three different FASTA sequences, a substring of ACTCACCACTGT, and an Intel i7-4700HQ, these were our outcomes.

Sequence (length)	Levenshtein	Sorensen	Time(Parallelized vs Non-parallelized)
E.Coli (4641652)	1	0.909091	2.01 sec vs 5.83 sec
Bacillus Anthracis (5227293)	1	0.909091	2.19 sec vs 6.46 sec
Bacillus Cereus (5411809)	1	0.909091	2.27 sec vs 6.60 sec

Set-up instructions

We use the following libraries: CMake, GoogleTest.

GoogleTest has to be compiled and cannot be easily installed. Because of this, we include a 1.7.0 release of GoogleTest in the code base, for the sake of convenience. No installation has to be done.

Thus, CMake is required. CMake is installed by default in most Linux distros, though we require version 3.4.

Install dependencies

Debian-based systems:

sudo apt install cmake

RedHat-based systems:

dnf install cmake

Set CXX compiler

A CXX compiler must be set somewhere in your environment. It's often easiest to add it to your terminal.

vim ~/.bashrc
export CXX=clang++  # Insert at bottom
:wq

Make

We use CMake to build our code. It produces a unit test executable (tester) and a command-line executable (run)

cd FMWS/build
cmake ..
make  # add -jX where X is twice the number of cores of your CPU

Use

To run the unit tests, call tester like below.

cd FMWS/build
./tester

To use the tool, call run like below.

cd FMWS/build
./run "Path to FASTA file" "some sequence"

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
build		build
include		include
lib/googletest		lib/googletest
src		src
test		test
.gitignore		.gitignore
BioInformatics Final Presentation.pdf		BioInformatics Final Presentation.pdf
CMakeLists.txt		CMakeLists.txt
CodeManual.pdf		CodeManual.pdf
Doxyfile		Doxyfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FMWS (Fuzzy Matching With Seeds)

What is it?

Why?

Motivation

Main Logic

Use Cases

Set-up instructions

About

Releases

Packages

Contributors 4

Languages

Zachary-Kramer/FMWS

Folders and files

Latest commit

History

Repository files navigation

FMWS (Fuzzy Matching With Seeds)

What is it?

Why?

Motivation

Main Logic

Use Cases

Set-up instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages