Skip to content
This repository has been archived by the owner on Oct 13, 2021. It is now read-only.
/ addresspack Public archive

An attempt to load AddressBase into SQLite

License

Notifications You must be signed in to change notification settings

arnau/addresspack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Addresspack

This project processes a standard “AddressBase Premium full (CSV)” into a SQLite database.

Before jumping into AddressBase details, you might want to consider OS Open UPRN. It is a slim option if you only want the coordinates for a given URPN. It is licenced under the Open Government Licence and offers two options: a table-oriented CSV and a GeoPackage.

Notice that AddressBase Core is also offered as GeoPackage.

Context

AddressBase Premium CSV is a particular beast. The download is a ~8GB zip file that contains around 352 files with a .csv extension.

Although these files are labelled as "csv" they are not structured as a table. They have to be treated as a single list of comma-separated lines where each line starts with the numeric identifier for the table they belong to.

So, to recompose the original relational model, you need to read the technical specification and create the expected schema.

If you only want construct tabular CSV files, you might be able to get away with just using the header files.

At this point, you need to read the given files line by line, parse each line as CSV, promote empty values as NULL, and load all ~352 million records into the expected table.

Why SQLite

The journey you have to go through to get from what Ordnance Survey gives you to the point where you are able to run your first query against the AddressBase database can't be qualified as a good experience.

SQLite is a single-file relational database with a good SQL engine. What would be the experience if instead of downloading the CSV zip you could instead download a SQLite database ready to be used? Perhaps the database could include some metadata and enough documentation to let you get an understanding of what is each table for and what are the relationships between them. Similar to what you get in the technical specification but co-located with the data.

Well, Addresspack is an exercise to explore this idea.

The bare schema before loading the full dataset provides table_info and column_info which lets you explore the basic documentation for each table and columns.

For example, to know what columns are available and what are they for for the table with id 10 (the first record of every CSV file starts with 10) you could:

sqlite3 addresspack.sqlite \
"select table_info.name, column_info.id, column_info.definition \
from column_info join table_info on column_info.table_id = table_info.id \
where table_id = 10;"

Getting the data

The Ordnance Survey website has an order's section where you can buy access to the AddressBase Premium database. The one Addresspack expects is the CSV with Download as a delivery method.

Install

You can download the pre-build package for your platform from the release page or you can build and install with cargo:

cargo install \
  --git https://github.com/arnau/addresspack \
  --tag v0.1.0

Build

Addresspack is a Rust command-line interface so you need Rust to compile this project.

Building it in release mode makes the difference performance-wise:

cargo build --release

Then, run the binary from the target/ directory:

./target/release/addresspack load --help

Note: It has been tested with Rust 1.44 but it is likely that it works for older versions as well.

Limitations

SQLite only allows a single writer so Addresspack can't use multiple threads to parallelise. Although I tried to make the ingestion process fast, it is extremely slow. In the order of hours.

By default, a transaction is commited every 10 files (i.e. every 10,000,000 inserts) using the SQLite WAL journaling. If the process stops (e.g. you kill the process), you can run it again and it will resume where it stopped with a potential data loss of 10 million rows.

If you have ideas or want to contribute to make it faster, you are more than welcome to reach to me!

Another limitation is that Addresspack is only able to process the full version of AddressBase Premium. I haven't looked at what would it mean to load a data update.

Impracticalities

The application_cross_reference table can be impractical to use due to its size:

$ time sqlite3 addresspack.sqlite "select count(*) from application_cross_reference;"
191207473
2.26s user 29.48s system 21% cpu 2:29.76 total

So, if there is a need for a query that requires a full scan, expect it to be slow.

Out of curiosity, this is the same count using xsv after exporting the table to CSV:

$ time xsv count xref.csv
191207473
30.49s user 8.93s system 96% cpu 40.764 total

In any case, for more realistic queries like finding records for a particular xref_key, SQLite is lightning fast whilst other means like xsv or rg are slower.

And of course, SQLite is a single file for the whole database which was the main driver for this exercise.

Measurements

These are some numbers that might help understand the scale of the task:

file size
AB76GB_CSV.zip 8.13 GB
AB76GB_CSV (unzipped, 352 files) 44 GB
addresspack.sqlite.zip 13.97 GB
addresspack.sqlite 46.6 GB
id table count
10 header 352
11 street 1439574
15 street_descriptor 1555308
21 basic_land_property_unit 39207495
23 application_cross_reference 191207473
24 local_property_identifier 43831026
28 delivery_point_address 29501884
29 metadata 352
30 successor 0
31 organisation 1268612
32 classification 43022571
99 trailer 352

Licence

The Addresspack codebase is licensed under the MIT licence (See LICENCE) unless explicitly stated at the top of the file.

Files that contain information taken from the AddressBase Premium technical specification have an Ordnance Survey licence and are under © Ordnance Survey Limited 2015.