Barman cloud WAL archive performance in Azure

tl;dr

WAL upload times from an Azure VM to Azure Blob Storage within the eu-west Azure region are consistent and fast at around 1s to upload a 32MB WAL.
WAL upload times between regions become less consistent with geographical distance.
Increasing the concurrency and reducing the upload block size improves upload times (although there will still be some outliers).
We can and should expose options in Barman Cloud which allow users to configure the concurrency of WAL uploads.

Overview

The following WAL archiving times from Azure AKS pods to Azure Blob Storage have been observed when using barman-cloud-wal-archive:

16MB WAL segment from AKS to Azure Blob Storage: 2 seconds
32MB WAL segment to Azure Blob Storage: 12 seconds
16MB WAL segment to Azure Blob Storage: 120 seconds

We need to:

Understand where the bottlenecks are.
Evaluate options for improving performance.

Initial testing

The simplest possible test is to upload a 16MB WAL segment from a dev laptop to an Azure Blob Storage container. This yields the following overall archive times:

16681ms, 16669ms, 15232ms

This is around 1MB per second which is not bad for uploads over home broadband. We can use debug logging to get a crude understanding of where Barman is spending its time:

2021-12-06 13:49:57,251 [23232] INFO: A body is sent with the request
2021-12-06 13:50:11,828 [23232] DEBUG: https://barmandev1.blob.core.windows.net:443 "PUT /mike-barman-test/my-backups/test-server/wals/0000000200000000/000000020000000000000074 HTTP/1.1" 201 0

We can use tcpdump to verify that the file is being uploaded during this wait:

13:39:15.352790 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [.], seq 1846:3286, ack 7597, win 4096, length 1440
...
13:39:29.700507 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [.], seq 16807126:16808566, ack 7597, win 4096, length 1440
13:39:29.700508 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [P.], seq 16808566:16808758, ack 7597, win 4096, length 192
13:39:29.838451 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [.], ack 8086, win 4088, length 0
13:39:29.849756 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [F.], seq 16808758, ack 8086, win 4096, length 0
13:39:29.868591 IP 192.168.1.128.51379 > 52.239.213.4.https: Flags [.], ack 8087, win 4096, length 0

We can compare the upload times through barman-cloud-wal-archive with the time taken to upload the same WAL via the Azure CLI - this uses the same python library so unsurprisingly yields a similar time of 15564ms.

At this point it looks a lot like we're simply saturating the available network bandwidth.

More serious testing

The following test environment was created:

One Azure VM in region eu-west.
One Azure storage account in region eu-west.
One Azure storage account in region us-east.
One Azure storage account in region us-west.

The following method was used to simulate uploading WAL segments:

Generate a 32MB file of random data from /dev/urandom.
Archive the WAL using barman-cloud-wal-archive.
Record the total time taken by the command.
Delete the WAL in cloud storage.

Baseline WAL archive times

A sample of 20 32MB WAL segments were uploaded to each region. The archive times to each region are shown in the following plots:

Baseline WAL archive times

We can make the following observations:

Upload times increase with geographical distance.
Upload times become less consistent with distance.
There appears to be a time-dependent component to increases in upload times between eu-west and us-west.

Optimising WAL archive times

Pass the WAL segment length to the Azure client

The first thing we can try to improve upload times is to supply the WAL segment length to the Azure client. According to the documentation this "should be supplied for optimal performance" so regardless of whether it helps or not we should do it anyway.

The following plots show the archive times to each region with the length of the WAL segment being supplied to the Azure client:

Length-optimized WAL archive times

It's hard to see from the noise but there doesn't seem to be much change here. We can also see time-dependent effects on the archive time from eu-west to both us-east and us-west.

Increase concurrency

Although at first glance the Azure client appears to have a minimum requirement of 64MB in order upload a file concurrently it is possible to change this limit when creating a container or blob client. We can therefore play with the concurrency and the number of chunks to be uploaded concurrently in the hope that this will translate to parallelism and therefore improve upload speed.

The following parameters are relevant:

For the ContainerClient:
- max_block_size: The maximum chunk size used when uploading a blob in chunks which defaults to 4MB.
- max_single_put_size: The size threshold which triggers the automating chunking and concurrent upload which defaults to 64MB.
For the upload_blob call:
- max_concurrency: The maximum number of connections to use when using concurrent upload.

We can therefore drop max_single_put_size to anything less than the WAL segment size and then set max_block_size and max_concurrency to improve our chances of uploading blob chunks in parallel.

Before we do any further tests we need to consider what we already know about network behavior between Azure regions, specifically that there is a lot of variation in the baseline upload time which appears to be time-dependent. We can try and mitigate this in two ways:

Increasing our sample size.
Performing requests for each different configuration in a round-robin manner, so that the chances of one configuration making all its requests at a more favorable time than another configuration are reduced.

We therefore make 100 requests for each concurrency setting and use a script which changes the concurrency after each request.

Histograms for WAL archive times to each region with concurrency values 1, 2, 4, 8 and 16 are shown below. The following should be noted:

The in-region WAL archive times are much faster so the bucket ranges are much smaller than for the inter-region plots.
There are a small number of outliers (the largest being around 200000ms) but to make interpretation easier anything over 40000ms has been omitted.

eu-west to eu-west

Archive times are already low between an eu-west VM and eu-west storage container. Increasing the concurrency arguably improves things a little but the difference is not really significant.

eu-west to us-east

Between the eu and us-east we see that a concurrency of 2 doesn't yield much of an improvement however values of 4 and higher result in more WAL archives completing under 5000ms. We also see the number of requests taking >10000ms decreasing as concurrency increases.

eu-west to us-west

Between the eu and us-west we see that the initial distribution, with concurrency 1, is much flatter with many values in excess of 10000ms. A concurrency of 2 does not offer much of an improvement but as concurrency increases further we see more WAL archive operations completing under 5000ms with concurrency 16 giving the highest volume of sub-5000ms requests.

Reduce block size / increase chunk count

For the previous tests a max_block_size value which yielded the same number of chunks as max_concurrency was used. This meant that each concurrent upload request would be uploading exactly one blob. This limits the overall upload speed to the slowest single request and it should be possible to improve performance further by reducing the value of max_block_size so that the number of chunks is a multiple of max_concurrency.

The following histograms show the original plots along with the same tests repeated using twice the number of chunks as max_concurrency.

eu-west to eu-west

Increasing the block count seems to have reduced the slight improvement that higher concurrency values yielded during the first round of tests.

block count == concurrency	block count == concurrency * 2

eu-west to us-east

Between the eu and us-east the higher block count appears to have yielded a small reduction in archiving times with the difference being most significant at concurrency 16.

block count == concurrency	block count == concurrency * 2

eu-west to us-west

Between the eu and us-west the difference is also small - there is an increase in the number of sub-5000ms requests with concurrency 16.

block count == concurrency	block count == concurrency * 2

Conclusions

Performance of WAL archiving between Azure regions can be improved by forcing the Azure client to use concurrent upload. Further (marginal) gains can be had by reducing the value of max_block_size.

Further things to test:

Increasing connection pool size. Tests were performed with the default value of 10 which means the concurrency==16 tests may have under-performed.
Further decreasing max_block_size.

That said, we have learnt enough here that we can justify exposing these parameters to Barman Cloud users such that they can optimize concurrency and chunk size for their specific use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barman cloud WAL archive performance in Azure

tl;dr

Overview

Initial testing

More serious testing

Baseline WAL archive times

Optimising WAL archive times

Pass the WAL segment length to the Azure client

Increase concurrency

eu-west to eu-west

eu-west to us-east

eu-west to us-west

Reduce block size / increase chunk count

eu-west to eu-west

eu-west to us-east

eu-west to us-west

Conclusions

Clone this wiki locally