A searchable database and command-line-interface for words and documents from Project Gutenberg
The simplest way to install and run the project is using docker
and docker-compose
. To get started, in this directory (containing the docker-compose.yml
file) run (or docker-compose up --build -d
to run in detached mode)
docker-compose up --build
This will build and start three services. The first is an empty Postgres database that has been initialized with the correct schema, and the second is the project documentation.
Tip Head over to
localhost:8000
on your browser to check the documenation out!
The third and final service built is the project cli, called gutensearch
. Once Docker compose has finished setting up, you can run the container with the command-line-interface installed using the following command. When specifying the volume to mount, this assumes you have kept the example data/
directory one level above this directory.
docker-compose run -v "$(pwd)/../data:/data" cli
then load the documents into the database using
gutensearch load --path data/ --multiprocessing
The example data provided has 1000 documents that are ready to be parsed and loaded. Depending on how many documents you are working with, you may or may not need to tune the memory limits in the docker-compose.yml
file prior to bringing the services up.
After this step has completed, you should see something similiar to this output
2020-10-28 01:46:11 [INFO] gutensearch.load - Parsing 1000 documents using 6 cores
2020-10-28 01:46:58 [INFO] gutensearch.load - Temporarily dropping indexes on table: words
2020-10-28 01:46:58 [INFO] gutensearch.load - Writing results to database
2020-10-28 01:47:36 [INFO] gutensearch.load - Finished writing data to database
2020-10-28 01:47:36 [INFO] gutensearch.load - Truncating table: distinct_words
2020-10-28 01:47:36 [INFO] gutensearch.load - Writing new distinct words to database
2020-10-28 01:47:49 [INFO] gutensearch.load - Finished writing distinct words to database
2020-10-28 01:47:49 [INFO] gutensearch.load - Recreating indexes on table: words
2020-10-28 01:48:04 [INFO] gutensearch.load - Committing changes to database
2020-10-28 01:48:04 [INFO] gutensearch.load - Running vacuum analyze on table: words
2020-10-28 01:48:05 [INFO] gutensearch.load - Committing changes to database
Your gutensearch
cli is now setup and ready to be used! Head over to the usage section on information on how to get started and what commands are available.
For more information on this command, see the gutensearch load
section below. If you run into any issues during installation & setup, please see the troubleshooting section below.
Alternatively, for a more "lightweight" installation where only the Postgres database is containerized, run
docker build -t db -f db.Dockerfile .
to build and tag the Postgres image. To start the database, run
docker run -d --rm --name db -p 5432:5432 --shm-size=1g -e "POSTGRES_DB=postgres" -e "POSTGRES_PASSWORD=postgres" -e "POSTGRES_USER=postgres" db
Then using your favorite virtual environment tool (virtualenv
, venv
, conda
etc.) create a new environment and install the gutensearch
package. The package requires Python 3.7+ For exampe, with venv
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade --no-cache-dir .
Verify the package has been properly by installed by running
gutensearch --help
you should see the following output
usage: gutensearch [-h] {download,load,word,doc} ...
A searchable database for words and documents from Project Gutenberg
positional arguments:
{download,load,word,doc}
download Download documents in a safe and respectful way from
Project Gutenberg
load Parse and load the word counts from documents into the
gutensearch database
word Find the documents where the given word occurs most
frequently
doc Find the most frequently occuring words in the given
document id
optional arguments:
-h, --help show this help message and exit
If installed properly, you can now load the data into the database in a similar way as shown above (again, assuming the example data provided is one level above your current working directory)
gutensearch load --path ../data/ --multiprocessing
You can use the gutensearch
CLI to easily download documents, parse/load them into the database, and search for words and documents with the following gutensearch
subprograms available
$ gutensearch --help
usage: gutensearch [-h] {download,load,word,doc} ...
A searchable database for words and documents from Project Gutenberg
positional arguments:
{download,load,word,doc}
download Download documents in a safe and respectful way from
Project Gutenberg
load Parse and load the word counts from documents into the
gutensearch database
word Find the documents where the given word occurs most
frequently
doc Find the most frequently occuring words in the given
document id
optional arguments:
-h, --help show this help message and exit
To download files from Project Gutenberg, use gutensearch download
. All downloads make use of the Aleph Gutenberg Mirror in order to be respectful to the Project Gutenberg servers, in accordance with their "robot access" guidelines. Please see the complete list of Project Gutenberg Mirrors for more information.
The entire package assumes that document id's are integers and relies on the Gutenberg Index to assign document id's accordingly. Text files downloaded that may have an extension such as 7854-8.txt
or 7854-0.txt
are automatically handled and cleaned during the download process so that the resulting document id is simply 7854
in accordance with the Gutenberg Index.
$ gutensearch download --help
usage: gutensearch download [-h] [--path PATH] [--limit LIMIT] [--delay DELAY]
[--log-level {notset,debug,info,warning,error,critical}]
[--only ONLY | --exclude EXCLUDE | --use-metadata]
optional arguments:
-h, --help show this help message and exit
--path PATH The path to the directory to store the documents
--limit LIMIT Stop the download after a certain number of documents
have been downloaded
--delay DELAY Number of seconds to delay between requests
--log-level {notset,debug,info,warning,error,critical}
Set the level for the logger
--only ONLY Download only the document ids listed in the given
file
--exclude EXCLUDE Download all document ids except those listed in the
given file
--use-metadata Use the .meta.json file to determine which documents
to download
For example, to download the first 1000 documents from Project Gutenberg, run
gutensearch download --limit 1000
This will automatically create a new data/
directory in the current directory, and begin download documents into data/
. During the download, metadata is saved every 10 successful downloads and can be inspected in the .meta.json
file created in data/.meta.json
.
All documents downloaded are saved as {id}.txt
. For example, in the data/
directory
$ ls -la | head -15
total 16541544
drwxr-xr-x 21424 k184444 354695482 685568 Oct 27 11:16 .
drwxr-xr-x 3 k184444 354695482 96 Oct 27 10:01 ..
-rw-r--r-- 1 k184444 354695482 3868652 Oct 26 10:40 .meta.json
-rw-r--r-- 1 k184444 354695482 4462487 Oct 23 12:52 10.txt
-rw-r--r-- 1 k184444 354695482 600106 Oct 23 14:09 1000.txt
-rw-r--r-- 1 k184444 354695482 101762 Oct 24 22:20 10000.txt
-rw-r--r-- 1 k184444 354695482 52510 Oct 24 22:20 10001.txt
-rw-r--r-- 1 k184444 354695482 306892 Oct 24 22:20 10002.txt
-rw-r--r-- 1 k184444 354695482 380817 Oct 24 22:20 10003.txt
-rw-r--r-- 1 k184444 354695482 302750 Oct 24 22:20 10004.txt
-rw-r--r-- 1 k184444 354695482 434760 Oct 24 22:20 10005.txt
-rw-r--r-- 1 k184444 354695482 95831 Oct 24 22:20 10006.txt
-rw-r--r-- 1 k184444 354695482 180139 Oct 24 22:21 10007.txt
-rw-r--r-- 1 k184444 354695482 407271 Oct 24 22:21 10008.txt
To change where the data is downloaded to (instead of the default data/
path) use
gutensearch download --path files
where files/
is the name of the directory where your documents will be saved to. If the directory already exists, it will be used.
If you only want to download a certain subset of documents by their id, create a file containing each integer id on a separate line. For example,
ids.txt
14130
24088
32918
43440
57076
gutensearch download --only ids.txt
this will only download the documents with the id's specified in the file above.
If you want to download all documents except certain ones, you can use the same approach as above except specifying the --exclude
flag
gutensearch download --except ids.txt
which will download all documents from Project Gutenberg except those specified in the file.
Finally, if a download was interrupted or stopped, you can pick back up to where it left off using the contents of the .meta.json
file. To do so, run
gutensearch download --use-metadata
which will begin download any files from the Gutenberg Index that are not present in the metadata file.
Unfortunately, some of the document id's in the Gutenberg Index do not have valid url's, or a url that follows the pattern of all the other files. Furthermore, even if the url is valid, there may be no book because the id may be reserved for the future. All of these cases are automatically handled during the download. For example,
2020-10-23 15:13:38 [INFO] gutensearch.download.download_gutenberg_documents - [1482/None] Saving document to path: data/1763.txt
2020-10-23 15:13:42 [INFO] gutensearch.download.download_gutenberg_documents - [1483/None] Saving document to path: data/1764.txt
2020-10-23 15:13:46 [ERROR] gutensearch.download.get_site_urls - 404 Client Error: Not Found for url: https://aleph.gutenberg.org/1/7/6/1766/
Traceback (most recent call last):
File "/Users/k184444/dev/gutensearch/gutensearch/download.py", line 87, in get_site_urls
response.raise_for_status()
File "/Users/k184444/Library/Caches/pypoetry/virtualenvs/gutensearch-2L6q6X4M-py3.7/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://aleph.gutenberg.org/1/7/6/1766/
2020-10-23 15:13:46 [INFO] gutensearch.download.download_gutenberg_documents - Skipping document id: 1766
2020-10-23 15:13:49 [ERROR] gutensearch.download.get_site_urls - 404 Client Error: Not Found for url: https://aleph.gutenberg.org/1/7/6/1767/
Traceback (most recent call last):
File "/Users/k184444/dev/gutensearch/gutensearch/download.py", line 87, in get_site_urls
response.raise_for_status()
File "/Users/k184444/Library/Caches/pypoetry/virtualenvs/gutensearch-2L6q6X4M-py3.7/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://aleph.gutenberg.org/1/7/6/1767/
2020-10-23 15:13:49 [INFO] gutensearch.download.download_gutenberg_documents - Skipping document id: 1767
2020-10-23 15:13:54 [INFO] gutensearch.download.download_gutenberg_documents - [1484/None] Saving document to path: data/1770.txt
2020-10-23 15:13:59 [INFO] gutensearch.download.download_gutenberg_documents - [1485/None] Saving document to path: data/1786.txt
Progress (including possible errors) are logged by the system and can be saved for further analysis or information if desired. You can change the logging level of the download at runtime by setting the value of --log-level
to one of {notset, debug, info, warning, error, critical}
. Logging is handled by the standard builtin Python logging
library.
For example, to only display warning messages or higher (severity messages), run
gutensearch download --log-level warning
Although logging is helpful for monitoring download progress, the .meta.json
file is the simplest source of finding information for successfully downloaded files. An example snippet of this file is provided below. Each key represents the id of the document.
.meta.json
{
"5389": {
"url": "https://aleph.gutenberg.org/5/3/8/5389/",
"datetime": "2020-10-24 11:54:10",
"filepath": "/Users/k184444/dev/gutensearch/temp/data/5389.txt"
},
"5390": {
"url": "https://aleph.gutenberg.org/5/3/9/5390/",
"datetime": "2020-10-24 11:54:15",
"filepath": "/Users/k184444/dev/gutensearch/temp/data/5390.txt"
},
"5391": {
"url": "https://aleph.gutenberg.org/5/3/9/5391/",
"datetime": "2020-10-24 11:54:20",
"filepath": "/Users/k184444/dev/gutensearch/temp/data/5391.txt"
}
}
Once you have obtained the raw documents (presumably using gutensearch download
) you'll want to parse and load their contents into the database. The gutensearch load
command provides an easy interface to perform this task.
Warning Make sure the Postgres service is available before running this command or else connecting to the database will fail. See the installation and setup instructions above for more information.
$ gutensearch load --help
usage: gutensearch load [-h] [--path PATH] [--limit LIMIT] [--multiprocessing]
[--log-level {notset,debug,info,warning,error,critical}]
optional arguments:
-h, --help show this help message and exit
--path PATH The path to the directory containing the documents
--limit LIMIT Only parse and load a limited number of documents
--multiprocessing Perform the parse/load in parallel using multiple
cores
--log-level {notset,debug,info,warning,error,critical}
Set the level for the logger
To parse and load documents that are in the data/
directory, from the current directory run
$ gutensearch load
2020-10-27 12:25:05 [INFO] gutensearch.load - Parsing 21421 documents using 1 core
which will begin processing every file in the data/
directory (default --path
) using a single core.
To specify a different directory name other than data/
(the default) add the --path
argument
$ gutensearch load --path documents
2020-10-27 12:25:05 [INFO] gutensearch.load - Parsing 21421 documents using 1 core
To limit the total number of documents to parse and load, specify a --limit
$ gutensearch load --limit 50
2020-10-27 12:26:44 [INFO] gutensearch.load - Parsing 50 documents using 1 core
Furthermore, for a significant speedup in parsing performance, you can parse multiple documents in parallel by specifying the --multiprocessing
flag. This is especially useful when parsing a large number of documents at once. Below is an example of the entire parse/load pipeline for 21,400+ documents.
$ gutensearch load --path data/ --multiprocessing
2020-10-27 00:31:15 [INFO] gutensearch.load - Parsing 21421 documents using 12 cores
2020-10-27 00:40:42 [INFO] gutensearch.load - Temporarily dropping indexes on table: words
2020-10-27 00:40:42 [INFO] gutensearch.load - Writing results to database
2020-10-27 00:51:18 [INFO] gutensearch.load - Finished writing data to database
2020-10-27 00:51:18 [INFO] gutensearch.load - Truncating table: distinct_words
2020-10-27 00:51:18 [INFO] gutensearch.load - Writing new distinct words to database
2020-10-27 00:53:31 [INFO] gutensearch.load - Finished writing distinct words to database
2020-10-27 00:53:31 [INFO] gutensearch.load - Recreating indexes on table: words
2020-10-27 00:58:22 [INFO] gutensearch.load - Committing changes to database
2020-10-27 00:58:22 [INFO] gutensearch.load - Running vacuum analyze on table: words
2020-10-27 00:58:56 [INFO] gutensearch.load - Committing changes to database
In short, the command will identify all .txt
files available in the specified directory, parse their contents by cleaning/tokenizing each word, and counting unique instances of each token. Then, the data is bulk loaded into Postgres, re-creating indexes and running statistics on the table(s) before exiting. Fore more details on this process, please see the Discussion and Technical Details section below.
With the words and counts of our documents parsed and loaded into the database, we can now perform a variety of interesting searches! The first type of search we can perform is to find the top n
documents where a given word appears. There are a variety of options and features available.
$ gutensearch word --help
usage: gutensearch word [-h] [-l LIMIT] [--fuzzy] [-o {csv,tsv,json}] word
positional arguments:
word The word to search for in the database
optional arguments:
-h, --help show this help message and exit
-l LIMIT, --limit LIMIT
Limit the total number of results returned
--fuzzy Allow search to use fuzzy word matching
-o {csv,tsv,json}, --output {csv,tsv,json}
The output format when printing to stdout
For example, to find the 10 documents where the word "fish" shows up most frequently, run
$ gutensearch word fish
word document_id count
fish 3611 3756
fish 18542 1212
fish 9937 590
fish 8419 545
fish 10136 531
fish 683 405
fish 21008 353
fish 18298 309
fish 4219 309
fish 6745 300
By default, the program will limit the results to the top 10 documents found. To change this behavior, set the --limit
flag to a different option. For example,
$ gutensearch word fish --limit 5
word document_id count
fish 3611 3756
fish 18542 1212
fish 9937 590
fish 8419 545
fish 10136 531
By default, the results will be printed to stdout
in tsv
format (tab-separated values). To change this behavior, set the value for the --output
flag. You can choose from one of three options: {tsv, csv, json}
, representing tab-separated-values, comma-separated-values, or JSON, respectively. Each are compatible so that you can redirect output directly to a file in a valid output format. For example the same search from above as JSON would be
$ gutensearch word fish --output json
[{'count': 3756, 'document_id': 3611, 'word': 'fish'},
{'count': 1212, 'document_id': 18542, 'word': 'fish'},
{'count': 590, 'document_id': 9937, 'word': 'fish'},
{'count': 545, 'document_id': 8419, 'word': 'fish'},
{'count': 531, 'document_id': 10136, 'word': 'fish'},
{'count': 405, 'document_id': 683, 'word': 'fish'},
{'count': 353, 'document_id': 21008, 'word': 'fish'},
{'count': 309, 'document_id': 18298, 'word': 'fish'},
{'count': 309, 'document_id': 4219, 'word': 'fish'},
{'count': 300, 'document_id': 6745, 'word': 'fish'}]
or could you save results directly to a CSV file using
$ gutensearch word fish --output csv > fish.csv
Furthermore, instead of full words you can also provide word patterns such as fish%
or doctor_
. Any SQL string pattern is valid and may be used. For example,
$ gutensearch word fish% -l 5 -o csv
word,document_id,count
fish,3611,3756
fishing,3611,2036
fishermen,3611,1216
fish,18542,1212
fish,9937,590
As we can see, instead of just "fish", the search returned results matching "fish", "fishing", "fisherman" and more.
Finally, you can even execute a search using "fuzzy word matching" for any word that cannot be effectively represented using a word pattern. To enable this behavior, set the --fuzzy
flag.
$ gutensearch word aquaintence -o csv --fuzzy
word,document_id,count
aquaintance,19704,2
aquaintance,3292,1
aquaintance,15772,1
aquaintance,13538,1
aquaintance,3141,1
aquaintance,968,1
aquaintance,947,1
aquaintance,10699,1
aquaintance,16014,1
aquaintance,15864,1
As we can see, the search returned results for the "best possible match" for the word "acquaintance", given the provided query had the word misspelled (missing the "c" after "q", and "e" instead of "a"). Fuzzy word matching uses the "ratio score" algorithm, and picks the word in the available corpus of text in the database with the highest ratio to the provided word. For more details, please see the Discussion and Technical Details section below.
We've seen how to search for all documents for a specific word, but what if we want to do the opposite? To perform a search for the top n
most frequently used words in a given document (id) we can use gutensearch doc
.
$ gutensearch doc --help
usage: gutensearch doc [-h] [-l LIMIT] [-m MIN_LENGTH] [-o {json,csv,tsv}] id
positional arguments:
id The document id to search for
optional arguments:
-h, --help show this help message and exit
-l LIMIT, --limit LIMIT
Limit the total number of results returned
-m MIN_LENGTH, --min-length MIN_LENGTH
Exclude any words in the search less than a minimum
character length
-o {json,csv,tsv}, --output {json,csv,tsv}
The output format when printing to stdout
For example, to the find the top 10 most frequent words in the document with id 8419
,
$ gutensearch doc 8419
word document_id count
which 8419 7601
this 8419 7189
with 8419 6922
they 8419 5003
that 8419 4714
river 8419 4495
from 8419 4272
their 8419 3244
them 8419 3150
about 8419 2860
Similar to before, by default, the program will limit the results to the top 10 documents found. To change this behavior, set the --limit flag to a different option. For example,
$ gutensearch doc 8419 --limit 5
word document_id count
which 8419 7601
this 8419 7189
with 8419 6922
they 8419 5003
that 8419 4714
Once again, the results printed to stdout
are in tsv
format by default. To change this behavior, use the --output
flag, choosing from one of the three options: {tsv, csv, json}
gutensearch doc 8419 --output json
[{'count': 7601, 'document_id': 8419, 'word': 'which'},
{'count': 7189, 'document_id': 8419, 'word': 'this'},
{'count': 6922, 'document_id': 8419, 'word': 'with'},
{'count': 5003, 'document_id': 8419, 'word': 'they'},
{'count': 4714, 'document_id': 8419, 'word': 'that'},
{'count': 4495, 'document_id': 8419, 'word': 'river'},
{'count': 4272, 'document_id': 8419, 'word': 'from'},
{'count': 3244, 'document_id': 8419, 'word': 'their'},
{'count': 3150, 'document_id': 8419, 'word': 'them'},
{'count': 2860, 'document_id': 8419, 'word': 'about'}]
By default, the optional --min-length
flag is set to 4 characters. This behavior excludes any words from the search that are less than a minimum character length. The default is set to 4 to avoid commonly encountered words such as "a" or "the". To increase the minimum character length, simply provide a different (integer) value to --min-length
as shown in the example below
$ gutensearch doc 8419 --min-length 12
word document_id count
considerable 8419 274
neighbourhood 8419 229
particularly 8419 136
sufficiently 8419 106
observations 8419 88
disagreeable 8419 84
notwithstanding 8419 53
considerably 8419 49
perpendicular 8419 48
circumstance 8419 46
The following section outlines a few problems you may (but hopefully don't) encounter when installing, setting-up, and running the project.
A directory that has been mounted to a container appears empty
Make sure that the data directory you are attempting to mount is not in the same directory as the .dockerignore
file. In specific, the .dockerignore
file ignores anything under /data/
to avoid unnecessarily sending large amounts of data to the Docker daemon when building containers.
Docker kills a task/process when loading data into the database
If you are attempting to parse/load a large number of documents at once, you may run into memory issues with Docker. The default setting in the cli
service in docker-compose.yml
is set to 4 gigabytes. However, if you're only building and running the cli image independently, you may need to include a --memory
flag during docker run
. See the Docker resource constraints documentation for more information. Furthermore, you can simply modify the configuration in docker-compose.yml
then re-run the steps in the installation.
The project is essentially divided into four different pieces, each responsible for different tasks in the entire workflow.
cli.py
: Contains the code for the command-line-interface implementation and provides themain()
function as the entrypoint for the entire interfacedatabase.py
: This module contains any database-related code and functions. Most of these are utilities for connecting to and querying the database. Furthermore, this module includes two functions for searching for words or documents from the database.download.py
: This module provides the implementations for downloading and saving data from Project Gutenberg, including building up the appropriate URL patterns, downloading a document, and saving documents.parse.py
: This module contains any parsing related functions, mostly related to parsing the contents of a document, cleaning it, "tokenization", and counting unique instance of words.
The project and code is designed in such a way that a "client" can choose to download all or some of the files (by their unique document id) locally, then parse and load them into the database. The database chosen was Postgres because it is what I'm most familiar with, and with a few simple indexing strategies provides excellent performance for the search requirements provided. Because Postgres supports concurrent reads/writes from multiple clients, it's a suitable choice for the database if this program was being executed on multiple machines at once. Although there is no mechanism in the current implementation for synchronization of files downloaded between multiple machines, this could be added by the use of a distributed task queue such as Celery. Aside from that, the current implementation is suitable for downloading and parsing documents from Project Gutenberg, and loading the tokenized word counts into a single database.
One aspect of the design that needed to be considered would be how documents would be identified. Luckily, it turns out that Project Gutenberg provides their own Gutenberg Index. Unfortunately, I could not find a "machine-readbale" format (despite a lot of digging around) so this needed to be parsed. However, this was not a big problem as it could be parsed once (only takes a few seconds) and then saved locally for any future use. The gutenberg-index.txt
file in the source code provides a list of every document id listed on Project Gutenberg. This was the key to making the download implementations simple and straightforward.
It turns out that documents saved on Project Gutenberg (the aleph.gutenberg.org
mirror in specific) follow a very specific url pattern. Once I realized that this pattern existed, it made constructing url's for a given document id very simple. For example, for document with id 6131
has the following url pattern: https://aleph.gutenberg.org/6/1/3/6131/
. If you follow the link, you'll see that this is a directory of several files, including both text and zip files. At this point, I realized a small possible issue that required I add a little bit of complexity to my design. In this particular example, you'll notice there are more than one text file, 6131-0.txt
and 6131.txt
. As far as I could tell, the contents of these two files were identical (I checked around 10 or so random documents and this proved to be the case for all of them, but I could have missed something, and this may not be the case for all documents). Furthermore, some documents only have a single text file (some with the trailing -0 and others without) while other documents had a trailing -8 in their name. For this reason, I decided that my download pattern for a given document id would be as follows:
- Construct url for a given document id. Ex: given id
6131
returnhttps://aleph.gutenberg.org/6/1/3/6131/
- Make a request to this url, and check that it's valid, no 404's are thrown etc.
- If valid, find all links on the site (
<a>
tags) ending with.txt
and store them - If more than one
.txt
file was found, keep the first one found in this order of precedence:{id}.txt
,{id}-0.txt
,{id}-8.txt
Once a document was successfully downloaded, it was saved to the target directory with the name {id}.txt
, regardless of the name used when being downloaded.
Once one or more documents are available locally (in whatever directory name the user wants, data/
by default) the user can load them into the database. But, before this is done each document needs to be parsed, tokenized, and unique word instances must be counted. For this step, I tried to take the simplest approach possible in order to keep the code readable and as performant as possible.
The entire parsing pipeline only requires a single pass over the document text to be cleaned and tokenized, and then one more pass to count unique words. If you inspect the implementatin of gutensearch.parse.lazytokenize
you'll see that this function returns a generator. This generator will iterate through each character in the given body of text. Each character encountered is added to a word
"buffer", then yield
ed to the user , and the word
buffer flushed when an "invalid" character was encountered. Only upper or lower case letter characters were considered valid (ASCII decimal values between 65 and 90, and 97 and 122). Whenever a valid character was stored in the word
buffer, it was saved as lower case. A "lazy" strategy was chosen so that theoretically, a document that could not fit in memory all at once could still be efficiently parsed. Furthermore, this function is pure and produces no side-effects which makes it ideal to be used in a parallelized setting.
Once a document was parsed and cleaned, I made use of the built-in Python Counter
class from the collections
module to count unique instances of each word. In conjunction with the lazy tokenizer described above, I implemented the gutensearch.parser.parse_document
function to parse, tokenize, and count word instances for a given document. This function can optionally be used with multiprocessing
which is an option provided to the user as part of the gutensearch load
command.
As briefly discussed above, Postgres was chosen for this project because I am familiar with it, it is high performance, and full-featured. The database only contains two tables, words
and distinct_words
. The schema for this database can be found in schema.sql
in this directory. We'll focus on words
first.
The words
table consists of three columns, word
, document_id
, and count
. It contains all of the records parsed from the sections described above. Each record contains a single word, the document id that it was found in, and the frequency (count) that it occurred. The key to making searches fast and effective over this table was to creating two indexes over this table. The first is an index on words
over the column word
. This allows for fast, indexed look-up of a specific word. Furthermore, the second index is an index on words
over the column document_id
which similarly provides fast, indexed look-up of a specific document in the table. Both of these indexes are effective because the cardinality of the columns are relatively small compared to the total number of records in the entire table. For the case analyzed with 21,421 unique documents containing over 134 million rows there were roughly 3.4 million unique words. Therefore, we'd expect that on average, searches for unique documents is faster than for unique words. Please see the benchmarks section for more information.
The second table, distinct_words
is a table with a single column word
and contains every unique (distinct) instance of a word in the words
table. The idea behind this table was to provide a pre-computed set that could be used as a corpus for performing fuzzy word matching. It turned out in practice that this was an ineffective approach for performing fuzzy word matching as querying the distinct_words
table whenever a fuzzy word match was requested (in addition to finding the closest match) was still relatively slow. Although word pattern matching using SQL string patterns still proved to be effective, if true fuzzy word matching was a hard requirement for this project, more work would need to be done to improve this aspect of the performance.
It can be tricky to efficiently load a large number of records into a table at once, especially in a relational database. However, Postgres provides a few helpful tips for performing "bulk loads" efficiently. I have made use of a few of these suggestions in my loading implementation. In short, after all documents have been parsed into words and counts, they are still in memory. In order to effectively write all of the data to the the words
table described above, I make use of the COPY FROM
command which allows for loading all of the rows in a single command instead of a series of INSERT
commands. In order to do this, psycopg2.cursor.copy_from
expects an instance of of an IO
object. Writing all of the data as a single text file to disk, then reading it back in to memory would have been slow and ineffective. Instead, I made use of the io.StringIO
class to incrementally build up a in-memory text buffer. Each record was written as tab-delimited values to the text buffer (as expected by Postgres) then efficiently written into the words
database. Prior to performing this operation, any indexes on words
were dropped, then later re-created after writing the data. Furthermore, after all of the data had been written, a VACUUM ANALYZE
command was also dispatched to provide further optimizations and up-to-date statistics that are used to improve the performance of the query planner. This strategy was used to effectively store over 134+ million records in around 8.5 minutes, after parsing 21,000+ documents. Please see the benchmarks section below for more information.
As mentioned in the database design section above, this project provides a fuzzy word matching feature that can be used when searching for words in the database. I took a simple approach inspired by the following blog post from SeatGeek when announcing the open-sourcing of their fuzzywuzzy
package. I opted not to include fuzzywuzzy
as part of my project in order to keep the dependencies as minimal as possible. Instead, I created a custom function (found under gutensearch.parse.closest_match
) that makes use of the Python built-in SequenceMatcher
object. Given a word and a corpus of words, the function will return a word from the corpus that most closely matches the given word by choosing the word with the highest "ratio". If there are any ties, they are resolved by selecting the first instance of the highest ratio found in the corpus. More information on the performance of this implementation in practice, please see the benchmarks below.
This section is mainly focused on the performance of the parsing, loading, and searching components of this project. All figures and benchmarks performed are only meant to be loosely interpreted for instructional use and context. They have been performed on a Macbook Pro (16 inch, 2019) with 2.6 GHz 6-Core Intel Core i7 processors, and 16 GB 2667 MHz DDR4 of RAM.
The first part of gutensearch load
includes parsing the contents of every document in the provided directory. From the logs of running gutensearch load --path data/ --multiprocessing
on a directory with 21,421 documents (of varying size and length) using all 6 cores (12 threads), the program parsed 134,855,452 records in 567 seconds.
2020-10-27 00:31:15 [INFO] gutensearch.load - Parsing 21421 documents using 12 cores
2020-10-27 00:40:42 [INFO] gutensearch.load - Temporarily dropping indexes on table: words
from gutensearch.database import query
records = query('SELECT COUNT(*) AS count FROM words')
count = records[0].count
print(count)
134855452
This equates to roughly 37.8 documents parsed per second, or 237,840 records parsed per second, on average.
After documents have been parsed in memory, they need to be efficiently bulk-loaded into the database. Once again, from the logs of running gutensearch load --path data/ --multiprocessing
on a directory with 21,421 documents (of varying size and length) using all 6 cores (12 threads), the program loaded 134,855,452 records in 504 seconds.
2020-10-27 00:40:42 [INFO] gutensearch.load - Writing results to database
2020-10-27 00:51:18 [INFO] gutensearch.load - Finished writing data to database
This equates to roughly 267,570 records loaded into the table per second, on average.
For a database with over 134+ million records, we have the following benchmarks. These were timed using the following setup in an ipython
terminal using the %timeit
cell magic. These are likely optimistic estimates since the same word/pattern is repeatedly being searched for in a single %timeit
block of loops.
>>> from gutensearch.database import search_word
>>> %timeit search_word(...)
Word | Fuzzy | Limit | Result |
---|---|---|---|
fish | FALSE | 10 | 38.9 ms ± 690 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
acceptance | FALSE | 10 | 31.6 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
consumable | FALSE | 10 | 9.43 ms ± 656 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
fish% | FALSE | 10 | 19.6 s ± 964 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
doctor_ | FALSE | 10 | 18.4 s ± 533 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
%ing | FALSE | 10 | 31.7 s ± 930 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
aquaintence | TRUE | 10 | 1min 24s ± 1.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each) |
For a database with over 3.4+ million unique words, we have the following benchmarks. The first is the typical time taken to query the entire "corpus" of distinct words from the database, and the second benchmarks the performance of performing a fuzzy word match against the corpus of text.
>>> from gutensearch.database import query_distinct_words
>>> %timeit query_distinct_words()
3.8 s ± 254 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> from gutensearch.parse import closest_match
>>> from gutensearch.database import query_distinct_words
>>>
>>> corpus = query_distinct_words()
>>> %timeit closest_match('aquaintence', corpus)
1min 25s ± 1.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a database with over 134+ million records, we have the following benchmarks. These were timed using the following setup in an ipython
terminal using the %timeit
cell magic. These are likely optimistic estimates since the same document is repeatedly being searched for in a single %timeit
block of loops.
>>> from gutensearch.database import search_document
>>> %timeit search_document(...)
Document ID | Min Length | Limit | Result |
---|---|---|---|
8419 | 4 | 10 | 31.4 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
8419 | 8 | 10 | 22 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
8419 | 12 | 10 | 13.4 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
8419 | 16 | 10 | 11.8 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
8419 | 20 | 10 | 12.2 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In summary, the main goals for this project have been met and provide satisfactory performance (between 8-40 ms on average for an exact word match, and 8-30 ms on average for a document id search). Furthermore, the command-line-interface provides a variety of features for downloading, parsing, loading, and performing various kinds of searches over the data. However, there is always room for improvement. Depending on the needs of the project, the following are a few possible enhancements and improvements that can be made to this project:
- Improve performance of fuzzy word matching. This is the slowest part of the current implementation. It would be useful to research the capabilities available within Postgres for text search and/or fuzzy word matching if available. A potential modification to the data structures or schema design may also be warranted.
- Possibility to directly parse and load a document into the database directly after download without saving to disk. This could help improve performance, and simplify the overall data acquisition and ETL process.
- Possibility for an "online" system that is constantly getting a list of document id's to download, parsing them, and loading them into the database. This could potentially be done using a distributed task queue that provides a queue of documents that need to be processed.