-
Notifications
You must be signed in to change notification settings - Fork 0
USASpending Database
The USASpending Database is a public database of US Government contracts that is released once a month. The database is published as a PostgreSQL backup and contains dozens of different tables and views. The following steps are required to use the data:
- Download the data (PostgreSQL backup)
- Create a new database in which to load the backup
- Identify items to restore
- Restore the backup
The USASpending database is released once a month. The database is timestamped with the 8th of the month and released a few days later. For example, usaspending-db_20240908.zip
is timestamped to September 8, 2024. The backup is huge and takes up to 8 hours to download. Here are the steps:
mkdir -p /data/downloads/spending/2024-09-08
cd /data/downloads/spending/2024-09-08
wget https://files.usaspending.gov/database_download/usaspending-db_20240908.zip
unzip usaspending-db_20240908.zip
In order to use the data (PostgreSQL backup), you must have a PostgreSQL database of your own. The following instructions assume that you have a PostgreSQL database installed and running locally.
This step is optional but recommended. Ideally, you want to create a folder on a relatively fast drive to store the database.
sudo mkdir -p /data/postgresql/usaspending
sudo chown postgres.postgres /data/postgresql/usaspending
sudo -i -u postgres psql -c "CREATE TABLESPACE usaspending LOCATION '/data/postgresql/usaspending';"
This step is required in order to restore most of the tables and views in the database.
sudo -i -u postgres psql -c "CREATE ROLE api_user WITH CREATEDB CREATEROLE;"
sudo -i -u postgres psql -c "CREATE ROLE root WITH CREATEDB CREATEROLE;"
sudo -i -u postgres psql -c "CREATE ROLE etl_user WITH CREATEDB CREATEROLE;"
There are a number of different ways to create a database in PostgreSQL. This is one example using the command line:
sudo -i -u postgres psql -c "create database usaspending;"
Here's another example using the command line. In this case, we are create a database called usaspending_20240908
using a specific tablespace called usaspending
:
sudo -i -u postgres psql -c "create database usaspending_20240908 tablespace usaspending;"
Once the data is downloaded we need to create a list of database items to restore (tables, views, indexes, etc).
The pg_restore
command is used to list all the items in the backup. Example:
pg_restore --list /data/downloads/spending/2024-09-08
The output can be redirected to a file which will be used to restore the database. Example:
pg_restore --list /data/downloads/spending/2024-09-08 > /data/downloads/spending/restore_2024-09-08.list
Note that you can prune the items you wish to restore. For example, the USASpending database has lots of materialized views that can take some time to restore. You can create a list that excludes these items using the sed
command like this:
pg_restore --list /data/downloads/spending/2024-09-08 | sed '/MATERIALIZED VIEW DATA/d' > /data/downloads/spending/restore_2024-09-08.list
Alternatively, you can restore just a single table like this:
pg_restore --list /data/downloads/spending/2024-09-08 | grep source_procurement_transaction | grep TABLE > /data/downloads/spending/restore_2024-09-08.list
Note when cherry picking tables, you must ensure that you create the corresponding schema or the restore process will fail. Example:
sudo -i -u postgres psql -d usaspending_20240908 -c "create schema raw;"
Once we have identified what we want to restore, we can run the pg_restore
command to restore the backup.
In the following example, we will use the list created above and load the backup into the usaspending_20240908
database:
sudo -i -u postgres pg_restore \
--jobs 16 \
--dbname postgresql://localhost:5432/usaspending_20240908 \
--verbose \
--exit-on-error \
--use-list /data/downloads/spending/restore_2024-09-08.list \
/data/downloads/spending/2024-09-08