Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nvmf module and clean up some docs #48

Merged
merged 3 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ COPY meson.* /app/
COPY src /app/src
COPY subprojects /app/subprojects
COPY test /app/test
COPY tools /app/tools

RUN make install-deps \
&& make release
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,5 @@ install-deps:
# LSVD deps
sudo apt install -y meson mold libfmt-dev librados-dev \
libjemalloc-dev libradospp-dev pkg-config uuid-dev ceph-common
# to make my life a little easier
sudo apt install -y gdb fish
240 changes: 84 additions & 156 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,6 @@ Note that although individual disk performance is important, the main goal is to
be able to support higher aggregate client IOPS against a given backend OSD
pool.

## what's here

This builds `liblsvd.so`, which provides most of the basic RBD API; you can use
`LD_PRELOAD` to use this in place of RBD with `fio`, KVM/QEMU, and a few other
tools. It also includes some tests and tools described below.

The repository also includes scripts to setup a SPDK NVMeoF target.

## Stability

This is NOT production-ready code; it still occasionally crashes, and some
Expand All @@ -30,66 +22,105 @@ It is able to install and boot Ubuntu 22.04 (see `qemu/`) and is stable under
most of our tests, but there are likely regressions around crash recovery and
other less well-trodden paths.

## Build
## How to run

This project uses `meson` to manage the build system. Run `make setup` to
generate the build files, then run `meson compile` in either `build-rel` or
`build-dbg` to build the release or debug versions of the code.
Note that the examples here use the fish shell, that the local nvme cache is
`/dev/nvme0n1`, and that the ceph config files are available in `/etc/ceph`.

A makefile is also offered for convenience; `make` builds the debug version
by default.
```
echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
sudo docker run --net host -v /dev/hugepages:/dev/hugepages -v /etc/ceph:/etc/ceph -v /var/tmp:/var/tmp -v /dev/shm:/dev/shm -v /mnt/nvme0:/lsvd -i -t --privileged --entrypoint /usr/bin/fish ghcr.io/cci-moc/lsvd-rbd:main
```

## Configuration
If you run into an error, you might need to rebuild the image:

LSVD is not yet merged into the Ceph configuration framework, and uses its own
system. It reads from a configuration file (`lsvd.conf` or
`/usr/local/etc/lsvd.conf`) or from environment variables of the form
`LSVD_<NAME>`, where NAME is the upper-case version of the config file variable.
Default values can be found in `config.h`

Parameters are:

- `batch_size`, `LSVD_BATCH_SIZE`: size of objects written to the backend, in bytes (K/M recognized as 1024, 1024\*1024). Default: 8MiB
- `wcache_batch`: write cache batching (see below)
- `wcache_chunk': maximum size of atomic write, in bytes - larger writes will be split and may be non-atomic.
- `rcache_dir` - directory used for read cache file and GC temporary files. Note that `imgtool` can format a partition for cache and symlink it into this directory, although the performance improvement seems limited.
- `wcache_dir` - directory used for write cache file
- `xlate_window`: max writes (i.e. objects) in flight to the backend. Note that this value is coupled to the size of the write cache, which must be big enough to hold all outstanding writes in case of a crash.
- `hard_sync` (untested): "flush" forces all batched writes to the backend.
- `backend`: "file" or "rados" (default rados). The "file" backend is for testing only
- `cache_size` (bytes, K/M/G): total size of the cache file. Currently split 1/3 write, 2/3 read. Ignored if the cache file already exists.
- `ckpt_interval` N: limits the number of objects to be examined during crash recovery by flushing metadata every N objects.
- `flush_msec`: timeout for flushing batched writes
- `gc_threshold` (percent): described below

Typically the only parameters that need to be set are `cache_dir` and
`cache_size`. Parameters may be added or removed as we tune things and/or
figure out how to optimize at runtime instead of bothering the user for a value.

## Using LSVD with fio and QEMU

First create a volume:
```
build$ sudo imgtool create poolname imgname --size=20g
git clone https://github.com/cci-moc/lsvd-rbd.git
cd lsvd-rbd
docker build -t lsvd-rbd .
sudo docker run --net host -v /dev/hugepages:/dev/hugepages -v /etc/ceph:/etc/ceph -v /var/tmp:/var/tmp -v /dev/shm:/dev/shm -v /mnt/nvme0:/lsvd -i -t --privileged --entrypoint /usr/bin/fish lsvd-rbd
```

Then you can start a SPDK NVMe-oF gateway:
To start the gateway:

```
./qemu/qemu-gateway.sh pool imgname
./build-rel/lsvd_tgt
```

Then connect to the NVMe-oF gateway:
The target will start listening to rpc commands on `/var/tmp/spdk.sock`.

To create an lsvd image on the backend:

```
nvme connect -t tcp -n nqn.2016-06.io.spdk:cnode1 -a
#./imgtool create <pool> <imgname> --size 100g
./imgtool create lsvd-ssd benchtest1 --size 100g
```

You should now have just a plain old NVMe device, with which you can use just
like any other NVMe device.
To configure nvmf:

```
cd subprojects/spdk/scripts
./rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
./rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
./rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 0.0.0.0 -s 9922
```

To mount images on the gateway:

```
export PYTHONPATH=/app/src/
./rpc.py --plugin rpc_plugin bdev_lsvd_create lsvd-ssd benchtest1 -c '{"rcache_dir":"/lsvd","wlog_dir":"/lsvd"}'
./rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 benchtest1
```

To gracefully shutdown gateway:

```
./rpc.py --plugin rpc_plugin bdev_lsvd_delete benchtest1
./rpc.py spdk_kill_instance SIGTERM
docker kill <container id>
```

## Mount a client

Fill in the appropriate IP address:

```
modprobe nvme-fabrics
nvme disconnect -n nqn.2016-06.io.spdk:cnode1
export gw_ip=${gw_ip:-192.168.52.109}
nvme connect -t tcp --traddr $gw_ip -s 9922 -n nqn.2016-06.io.spdk:cnode1 -o normal
sleep 2
nvme list
dev_name=$(nvme list | perl -lane 'print @F[0] if /SPDK/')
printf "Using device $dev_name\n"
```

## Build

This project uses `meson` to manage the build system. Run `make setup` to
generate the build files, then run `meson compile` in either `build-rel` or
`build-dbg` to build the release or debug versions of the code.

A makefile is also offered for convenience; `make` builds the debug version
by default.

## Configuration

LSVD is configured using a JSON file. When creating an image, we will
try to read the following paths and parse them for configuration options:

Do not use multiple fio jobs on the same image - currently there's no protection
and they'll stomp all over each other. RBD performs horribly in that case, but
AFAIK it doesn't compromise correctness.
- Default built-in configuration
- `/usr/local/etc/lsvd.json`
- `./lsvd.json`
- user supplied path

The file read last has highest priority.

We will also first try to parse the user-supplied path as a JSON object, and if
that fails try treat it as a path and read it from a file.

An example configuration file is provided in `docs/example_config.json`.

## Image and object names

Expand Down Expand Up @@ -172,106 +203,3 @@ Allowed options:
```

Other tools live in the `tools` subdirectory - see the README there for more details.

## Usage

### Running SPDK target

You might need to enable hugepages:
```
sudo sh -c 'echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages'
```

Now we start the target, with or without `LD_PRELOAD`, potentially under the debugger. Run `spdk_tgt --help` for more options - in particular, the RPC socket defaults to `/var/tmp/spdk.sock` but a different one can be specified, which might allow running multiple instances of SPDK. Also the roc command has a `—help` option, which is about 500 lines long.

```
SPDK=/mnt/nvme/ceph-nvmeof/spdk
sudo LD_PRELOAD=$PWD/liblsvd.so $SPDK/build/bin/spdk_tgt
```

Here's a simple setup - the first two steps are handled in the ceph-nvmeof python code, and it may be worth looking through the code more to see what options they use.

```
sudo $SPDK/scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
sudo $SPDK/scripts/rpc.py bdev_rbd_register_cluster rbd_cluster
sudo $SPDK/scripts/rpc.py bdev_rbd_create rbd rbd/fio-target 4096 -c rbd_cluster
sudo $SPDK/scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
sudo $SPDK/scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 Ceph0
sudo $SPDK/scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 10.1.0.8 -s 5001
```

Note also that you can create a ramdisk test, by (1) creating a ramdisk with brd, and (2) creating another bdev / namespace with `bdev_aio_create`. With the version of SPDK I have, it does 4KB random read/write at about 100K IOPS, or at least it did, a month or two ago, on the HP machines.

Finally, I’m not totally convinced that the options I used are the best ones - the -u/-m/-c options for `create_transport` were blindly copied from a doc page. I’m a little more convinced that specifying a 4KB block size in `dev_rbd_create` is a good idea.

## Tests

There are two tests included: `lsvd_rnd_test` and `lsvd_crash_test`.
They do random writes of various sizes, with random data, and each 512-byte sector is "stamped" with its LBA and a sequence number for the write.
CRCs are saved for each sector, and after a bunch of writes we read everything back and verify that the CRCs match.

### `lsvd_rnd_test`

```
build$ bin/lsvd_rnd_test --help
Usage: lsvd_rnd_test [OPTION...] RUNS

-c, --close close and re-open
-d, --cache-dir=DIR cache directory
-D, --delay add random backend delays
-k, --keep keep data between tests
-l, --len=N run length
-O, --rados use RADOS
-p, --prefix=PREFIX object prefix
-r, --reads=FRAC fraction reads (0.0-1.0)
-R, --reverse reverse NVMe completion order
-s, --seed=S use this seed (one run)
-v, --verbose print LBAs and CRCs
-w, --window=W write window
-x, --existing don't delete existing cache
-z, --size=S volume size (e.g. 1G, 100M)
-Z, --cache-size=N cache size (K/M/G)
-?, --help Give this help list
--usage Give a short usage message
```

Unlike the normal library, it defaults to storing objects on the filesystem; the image name is just the path to the superblock object (the --prefix argument), and other objects live in the same directory.
If you use this, you probably want to use the `--delay` flag, to have object read/write requests subject to random delays.
It creates a volume of --size bytes, does --len random writes of random lengths, and then reads it all back and checks CRCs.
It can do multiple runs; if you don't specify --keep it will delete and recreate the volume between runs.
The --close flag causes it to close and re-open the image between runs; otherwise it stays open.

### `lsvd_rnd_test`

This is pretty similar, except that does the writes in a subprocess which kills itself with `_exit` rather than finishing gracefully, and it has an option to delete the cache before restarting.

This one needs to be run with the file backend, because some of the test options crash the writer, recover the image to read and verify it, then restore it back to its crashed state before starting the writer up again.

It uses the write sequence numbers to figure out which writes made it to disk before the crash, scanning all the sectors to find the highest sequence number stamp, then it veries that the image matches what you would get if you apply all writes up to and including that sequence number.

```
build$ bin/lsvd_crash_test --help
Usage: lsvd_crash_test [OPTION...] RUNS

-2, --seed2 seed-generating seed
-d, --cache-dir=DIR cache directory
-D, --delay add random backend delays
-k, --keep keep data between tests
-l, --len=N run length
-L, --lose-writes=N delete some of last N cache writes
-n, --no-wipe don't clear image between runs
-o, --lose-objs=N delete some of last N objects
-p, --prefix=PREFIX object prefix
-r, --reads=FRAC fraction reads (0.0-1.0)
-R, --reverse reverse NVMe completion order
-s, --seed=S use this seed (one run)
-S, --sleep child sleeps for debug attach
-v, --verbose print LBAs and CRCs
-w, --window=W write window
-W, --wipe-cache delete cache on restart
-x, --existing don't delete existing cache
-z, --size=S volume size (e.g. 1G, 100M)
-Z, --cache-size=N cache size (K/M/G)
-?, --help Give this help list
--usage Give a short usage message
```
14 changes: 0 additions & 14 deletions docs/configuration.md

This file was deleted.

11 changes: 0 additions & 11 deletions docs/install.md

This file was deleted.

Loading
Loading