Skip to content

Distributing scans

Thomas Leibovici edited this page Mar 4, 2015 · 3 revisions

Robinhood allows scanning only a part of the namespace, by specifying an argument to --scan. This feature can be used to parallelize the namespace scan accross multiple filesystem clients. This can be useful if the limiting factor of scanning speed is the client filesystem call throughput.

  • Make your usual robinhood config available to all nodes that will run the scan. Make sure the database host is designated by its hostname (i.e. not "localhost").
  • Make sure the database is accessible from all those nodes. For this, you can use rbh-config helper:

rbh-config test_db <db_name> <password>

  • Determine a balanced partitioning (in terms of entry count) of filesystem top-level directories to be distributed to the robinhood scanning commands. For example:

client1: /fs/dir1 and /fs/dir2
client2: /fs/dir3/A
client3: /fs/dir3/B

  • Run robinhood commands accordingly:

client1: robinhood --scan=/fs/dir1 --no-gc ;
robinhood --scan=/fs/dir2 --no-gc
client2: robinhood --scan=/fs/dir3/A --no-gc
client3: robinhood --scan=/fs/dir3/B --no-gc

Note: Specifying --no-gc is very important for performance in this use-case. If it is not specified, robinhood tries to clean entries that were previously located in this part of the namespace and than have not been seen during the scan. This cleaning is VERY expensive for partial scanning as it requires to build and match the path of all entries in the DB. This cleaning is much more efficient for whole filesystem scans, so it is recommended to keep it enabled only for such whole scans.

Clone this wiki locally