Distributing scans

Robinhood allows scanning only a part of the namespace, by specifying an argument to --scan. This feature can be used to parallelize the namespace scan accross multiple filesystem clients. This can be useful if the limiting factor of scanning speed is the client filesystem call throughput.

Make your usual robinhood config available to all nodes that will run the scan. Make sure the database host is designated by its hostname (i.e. not "localhost").
Make sure the database is accessible from all those nodes. For this, you can use rbh-config helper:

rbh-config test_db <db_name> <password>

Determine a balanced partitioning (in terms of entry count) of filesystem top-level directories to be distributed to the robinhood scanning commands. For example:

client1: /fs/dir1 and /fs/dir2
client2: /fs/dir3/A
client3: /fs/dir3/B

Run robinhood commands accordingly:

client1: robinhood --scan=/fs/dir1 --no-gc ;
robinhood --scan=/fs/dir2 --no-gc
client2: robinhood --scan=/fs/dir3/A --no-gc
client3: robinhood --scan=/fs/dir3/B --no-gc

Note: Specifying --no-gc is very important for performance in this use-case. If it is not specified, robinhood tries to clean entries that were previously located in this part of the namespace and than have not been seen during the scan. This cleaning is VERY expensive for partial scanning as it requires to build and match the path of all entries in the DB. This cleaning is much more efficient for whole filesystem scans, so it is recommended to keep it enabled only for such whole scans.

Back to wiki home

Download latest version: robinhood 3.1.6 (2020/09/29)
Papers and presentations
Online documentation
Subscribe robinhood-news
Mail robinhood-support
Wiki home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributing scans

Clone this wiki locally