This project implements an algorithm which for two given articles finds all shortest paths between them. The search engine performs a BFS using tail recursion, starting from article provided as an entry point in the input file.
In each recursion call a queue of list of links (the list of links is representing their path from starting article) to process and a list of results (paths found) are passed. For a given article link taken from the head of a dequeued path the engine checks:
- If results is not empty and length of the path we're currently processing is greater than length of the path in results. If both conditions are met, we stop the search since all shortest paths have been found.
- If it has been visited. If so, we just call the recursive function passing the queue without its first element.
- If it is the endpoint we're looking for. If so, we add it to the results list and go on.
- In other case we read all Wikipedia links inside given article and enqueue them with their path from the starting article and continue the search.
The search begins with only starting article on the queue.
In order for this project to work you need to have:
- Scala version minimum 2.13.8
- sbt version minimum 1.8.3 (https://www.scala-sbt.org/download.html)
- In the directory with
build.sbt
file runsbt -mem 2048
command in your terminal. (flag -mem sets heap size to twich as much as is set by default in order to allow the program to execute a complex tail recursion, which puts a heavy load on the heap) - Once
sbt
is up and running, insidesbt
runcompile
. This will automatically create necessary dependencies, compile files and link them. - Now you just have to run the project. For details, please look below.
During developement of this program on my machine locally I noticed the most time consuming operation is by far fetching HTML data (each fetch was minimum 50ms, even with almost-empty sites). I noticed that on GiHub Codespace HTML fetch time is sometimes even 10x lower. I am assuming then that the problem was caused by my local machine configuration but if you face it too, I recommend using something like GitHub Codespace in order to boost the efficiency.
TL;DR If on your local machine automatic tests run longer than 10s, I recommend running the program on something like GitHub Codespace.
Once you have compiled the project and downloaded necessary requirements, you can run the app inside sbt with command
run <absoluteInputPath> <absoluteOutputPath>
absoluteInputPath
is an absolute path to your input txt file
with sites between which you'd like to know shortest paths. For detailed information
regarding the type of input, please look below.
absoluteOutputPath
is an absolute path to the txt file where you want to store
the output. The path must be valid but the txt file itself does not have to exist.
In that case, the program creates specified .txt
file and places the output
right there. If file already exists, the program just overrides file's content.
The program requires a .txt
file as its input. Each line of this file must be of type
(langCode, srcName, destName)
and ended by a newline character.
langCode
- valid language Wikipedia code (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, 639-1 column). For example for english it would been
or for polishpl
.srcName
- valid Wikipedia article title from which you would like to start the search.. You can look it up in the url of your chosen Wikipedia article. For example for an article available at https://pl.wikipedia.org/wiki/Skala_betów thesrcName
would beSkala_betów
.destName
- valid Wikipedia article title to which you would like to get all shortest paths fromsrcName
article.
In case of an incorrect input an appropriate message will be shown.
As an output the program created or overrides .txt
file at location provided by
absoluteOutputPath
argument in run
command. In this file the program for each
(langCode, srcName, destName)
will put into output file a separate line with all
shortest paths found, separated by comma and space and sorted alphabetically.
Automatic tests can be run in sbt
using test
command