Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepNVMe example scripts #914

Merged
merged 15 commits into from
Aug 21, 2024
116 changes: 116 additions & 0 deletions deepnvme/file_access/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Using DeepNVMe for simple file reads and writes involving CPU/GPU tensors

The purpose of this folder is to provide example codes that illustrate how to use DeepNVMe for simple file operations of moving raw data bytes between persistent storage and CPU/GPU tensors. For each file operation, we provide an implementation using Python I/O functionality, and a DeepNVMe implementation using CPU bounce buffer (aio) and NVIDIA Magnum IO<sup>TM</sup> GPUDirect® Storage (GDS) as appropriate.

The following table is a mapping of file operations to the corresponding Python and DeepNVMe implementations.


File Operation | Python | DeepNVMe (aio) | DeepNVMe (GDS)
|---|---|---|---|
Load CPU tensor from file | py_load_cpu_tensor.py | aio_load_cpu_tensor.py | - |
Load GPU tensor from file | py_load_gpu_tensor.py | aio_load_gpu_tensor.py | gds_load_gpu_tensor.py |
Store CPU tensor to file | py_store_cpu_tensor.py | aio_store_cpu_tensor.py | - |
Store GPU tensor to file | py_store_gpu_tensor.py | aio_store_gpu_tensor.py | gds_store_gpu_tensor.py |

The Python implementations are the scripts with `py_` prefix. while the DeepNVMe implementations are those with`aio_` and `gds_`prefixes.

## Requirements
Ensure your environment is properly configured to run these examples. First, you need to install DeepSpeed version >= 0.15.0. Next, ensure that the DeepNVMe operators are available in the DeepSpeed installation. The `async_io` operator is required for any DeepNVMe functionality, while the `gds` operator is required only for GDS functionality. You can confirm availability of each operator by inspecting the output of `ds_report` to check that compatible status is <span style="color:green">[OKAY]</span>. Below is a snippet of `ds_report` output showing availability of both `async_io` and `gds` operators.

<div align="center">
<img src="./media/deepnvme_ops_report.png" style="width:6.5in;height:3.42153in" />
</div>
<div align="center">
ds_report output showing availability of DeepNVMe operators (async_io and gds) in a DeepSpeed installation.
</div>


If `async_io` opertator is unavailable, you will need to install the appropriate `libaio` library binaries for your Linux flavor. For example, Ubuntu users will need to run `apt install libaio-dev`. In general, you should carefully inspect `ds_report` output for helpful tips such as the following:

```bash
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
```

To enable `gds` operator, you will need to install NVIDIA GDS by consulting the appropriate guide for [bare-metal systems](https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html) or Azure VMs (coming soon).

## Tensor Load Examples
The tensor load example scripts share a common command-line interface, which is illustrated below using `py_read_load_cpu_tensor.py`.
```bash
$ python py_load_cpu_tensor.py --help
usage: py_load_cpu_tensor.py [-h] --input_file INPUT_FILE [--loop LOOP] [--validate]

options:
-h, --help show this help message and exit
--input_file INPUT_FILE
File on NVMe device that will read as input.
--loop LOOP The number of times to repeat the operation (default 3).
--validate Run validation step that compares tensor value against Python file read
```
Before running these example scripts ensure that the input file exists on an NVMe device. The `--validate` option is relevant only to the DeepNVme implementations. This option provides minimal correctness checking by comparing against a tensor loaded using Python. We also provide a bash script `run_load_tensor.sh`, which runs all the example tensor load scripts.


## Tensor Store Examples
The tensor store examples share a command-line interface, which is illustrated below using `py_store_cpu_tensor.py`
```bash
$ python py_store_cpu_tensor.py --help
usage: py_store_cpu_tensor.py [-h] --nvme_folder NVME_FOLDER [--mb_size MB_SIZE] [--loop LOOP] [--validate]

options:
-h, --help show this help message and exit
--nvme_folder NVME_FOLDER
NVMe folder for file write.
--mb_size MB_SIZE Size of tensor to save in MB (default 1024).
--loop LOOP The number of times to repeat the operation (default 3).
--validate Run validation step that compares tensor value against Python file read

```
Before running these examples ensure that the output folder exists on an NVMe device and that you have write permission. The `--validate` option is relevant only to the DeepNVMe implementations. This option provides minimal correctness checking by comparing the output file against that created using Python. We also provide a bash script `run_store_tensor.sh`, which runs all the example tensor store scripts.


## Performance Advisory
Although this folder is primarily meant to help with integrating DeepNVMe into your Deep Learning applications, the example scripts also print out performance numbers of read and write throughput. So, we expect you will observe some performance advantage of DeepNVMe compared to Python. However, do note that it is likely that better performance can be realized by tuning DeepNVMe for your environment. Such tuning efforts will ideally generate more optimal values for configuring DeepNVMe.

For reference, DeepNVMe configuration using hard-coded constants for `aio_` implementations is as follows:

```python
aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
```

The corresponding DeepNVMe configuration for `gds_` implementations is as follows:

```python
gds_handle = GDSBuilder().load().gds_handle(1024**2, 128, True, True, 1)
```

Despite the above caveat, it seems that some performance numbers would be useful here to help set the right expectations. The experiments were conducted on an Azure [NC80adis_H100_v5](https://learn.microsoft.com/en-us/azure/virtual-machines/ncads-h100-v5) series virtual machine (VM). This VM includes two 3.5TB local NVMe devices (labelled Microsoft NVMe Direct Disk v2) that we combined into a single RAID-0 volume. The software environment included Ubuntu 22.04.4 LTS, Linux kernel 6.5.0-26-generic, Pytorch 2.4, and CUDA 12.4. We ran experiments of 1GB data transfers using the unmodified scripts, i.e., without DeepNVMe tuning, and present the throughput results in the tables below. In summary, we observed that DeepNVMe significantly accelerates I/O operations compared to Python. DeepNVMe is 8-16X faster for loading tensor data, and 11X-119X faster for writing tensor data.

Load 1GB CPU tensor (1GB file read) | GB/sec | Speedup over Python |
|---|---|---|
py_load_cpu_tensor.py | 1.5 | - |
aio_load_cpu_tensor.py | 12.3 | 8X |

Load 1GB GPU tensor (1GB file read) | GB/sec | Speedup over Python |
|---|---|---|
py_load_gpu_tensor.py | 0.7| - |
aio_load_gpu_tensor.py | 9.9 | 14X |
gds_load_gpu_tensor.py | 11.1 | 16X |


Store 1GB CPU tensor (1GB file write) | GB/sec | Speedup over Python |
|---|---|---|
py_store_cpu_tensor.py | 0.7 | - |
aio_store_cpu_tensor.py | 8.1 | 11X |


Store 1GB GPU tensor (1GB file write) | GB/sec | Speedup over Python |
|---|---|---|
py_store_gpu_tensor.py | 0.5 | - |
aio_store_gpu_tensor.py | 8.3 | 18X |
gds_store_gpu_tensor.py | 8.6 | 19X |



# Conclusion
We hope you find this document and example scripts useful for integrating DeepNVMe into your applications.
31 changes: 31 additions & 0 deletions deepnvme/file_access/aio_load_cpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import torch
import os, timeit, functools
from deepspeed.ops.op_builder import AsyncIOBuilder
from utils import parse_read_arguments, GIGA_UNIT

def file_read(inp_f, handle, bounce_buffer):
handle.sync_pread(bounce_buffer, inp_f)
return bounce_buffer.cpu()

def main():
args = parse_read_arguments()
input_file = args.input_file
file_sz = os.path.getsize(input_file)
cnt = args.loop

aio_handle = AsyncIOBuilder().load().aio_handle()
bounce_buffer = torch.empty(os.path.getsize(input_file), dtype=torch.uint8).pin_memory()

t = timeit.Timer(functools.partial(file_read, input_file, aio_handle, bounce_buffer))
aio_t = t.timeit(cnt)
aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
print(f'aio load_cpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')

if args.validate:
from py_load_cpu_tensor import file_read as py_file_read
aio_tensor = file_read(input_file, aio_handle, bounce_buffer)
py_tensor = py_file_read(input_file)
print(f'Validation success = {aio_tensor.equal(py_tensor)}')

if __name__ == "__main__":
main()
32 changes: 32 additions & 0 deletions deepnvme/file_access/aio_load_gpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import torch
import os, timeit, functools
from deepspeed.ops.op_builder import AsyncIOBuilder
from utils import parse_read_arguments, GIGA_UNIT

def file_read(inp_f, handle, bounce_buffer):
handle.sync_pread(bounce_buffer, inp_f)
return bounce_buffer.cuda()


def main():
args = parse_read_arguments()
input_file = args.input_file
file_sz = os.path.getsize(input_file)
cnt = args.loop

aio_handle = AsyncIOBuilder().load().aio_handle()
bounce_buffer = torch.empty(os.path.getsize(input_file), dtype=torch.uint8).pin_memory()

t = timeit.Timer(functools.partial(file_read, input_file, aio_handle, bounce_buffer))
aio_t = t.timeit(cnt)
aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
print(f'aio load_gpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')

if args.validate:
from py_load_cpu_tensor import file_read as py_file_read
aio_tensor = file_read(input_file, aio_handle, bounce_buffer).cpu()
py_tensor = py_file_read(input_file)
print(f'Validation success = {aio_tensor.equal(py_tensor)}')

if __name__ == "__main__":
main()
40 changes: 40 additions & 0 deletions deepnvme/file_access/aio_store_cpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import torch
import os, timeit, functools, pathlib
from deepspeed.ops.op_builder import AsyncIOBuilder
from utils import parse_write_arguments, GIGA_UNIT

def file_write(out_f, tensor, handle, bounce_buffer):
bounce_buffer.copy_(tensor)
handle.sync_pwrite(bounce_buffer, out_f)

def main():
args = parse_write_arguments()
cnt = args.loop
output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
pathlib.Path(output_file).unlink(missing_ok=True)
file_sz = args.mb_size*(1024**2)
app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cpu', requires_grad=False)

aio_handle = AsyncIOBuilder().load().aio_handle()
bounce_buffer = torch.empty(file_sz, dtype=torch.uint8, requires_grad=False).pin_memory()


t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, aio_handle, bounce_buffer))

aio_t = t.timeit(cnt)
aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
print(f'aio store_cpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')

if args.validate:
import tempfile, filecmp
from py_store_cpu_tensor import file_write as py_file_write
py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
py_file_write(py_ref_file, app_tensor)
filecmp.clear_cache()
print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')

pathlib.Path(output_file).unlink(missing_ok=True)


if __name__ == "__main__":
main()
40 changes: 40 additions & 0 deletions deepnvme/file_access/aio_store_gpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import torch
import os, timeit, functools, pathlib
from deepspeed.ops.op_builder import AsyncIOBuilder
from utils import parse_write_arguments, GIGA_UNIT

def file_write(out_f, tensor, handle, bounce_buffer):
bounce_buffer.copy_(tensor)
handle.sync_pwrite(bounce_buffer, out_f)

def main():
args = parse_write_arguments()
cnt = args.loop
output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
pathlib.Path(output_file).unlink(missing_ok=True)
file_sz = args.mb_size*(1024**2)
app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cuda', requires_grad=False)

aio_handle = AsyncIOBuilder().load().aio_handle()
bounce_buffer = torch.empty(file_sz, dtype=torch.uint8, requires_grad=False).pin_memory()


t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, aio_handle, bounce_buffer))

aio_t = t.timeit(cnt)
aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
print(f'aio store_gpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')

if args.validate:
import tempfile, filecmp
from py_store_cpu_tensor import file_write as py_file_write
py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
py_file_write(py_ref_file, app_tensor)
filecmp.clear_cache()
print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')

pathlib.Path(output_file).unlink(missing_ok=True)


if __name__ == "__main__":
main()
33 changes: 33 additions & 0 deletions deepnvme/file_access/gds_load_gpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import torch
import os, timeit, functools
from utils import parse_read_arguments, GIGA_UNIT
from deepspeed.ops.op_builder import GDSBuilder

def file_read(inp_f, handle, gpu_buffer):
handle.sync_pread(gpu_buffer, inp_f)
return gpu_buffer.cuda()

def main():
args = parse_read_arguments()
input_file = args.input_file
file_sz = os.path.getsize(input_file)
cnt = args.loop

gds_handle = GDSBuilder().load().gds_handle()
gds_buffer = gds_handle.new_pinned_device_tensor(file_sz, torch.empty(0, dtype=torch.uint8, device='cuda', requires_grad=False))

t = timeit.Timer(functools.partial(file_read, input_file, gds_handle, gds_buffer))
gds_t = t.timeit(cnt)
gds_gbs = (cnt*file_sz)/GIGA_UNIT/gds_t
print(f'gds load_gpu: {file_sz/GIGA_UNIT} GB, {gds_t/cnt} secs, {gds_gbs:5.2f} GB/sec')

if args.validate:
from py_load_cpu_tensor import file_read as py_file_read
aio_tensor = file_read(input_file, gds_handle, gds_buffer).cpu()
py_tensor = py_file_read(input_file)
print(f'Validation success = {aio_tensor.equal(py_tensor)}')

gds_handle.free_pinned_device_tensor(gds_buffer)

if __name__ == "__main__":
main()
39 changes: 39 additions & 0 deletions deepnvme/file_access/gds_store_gpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import torch
import os, timeit, functools, pathlib
from deepspeed.ops.op_builder import GDSBuilder
from utils import parse_write_arguments, GIGA_UNIT

def file_write(out_f, tensor, handle, gpu_buffer):
gpu_buffer.copy_(tensor)
handle.sync_pwrite(gpu_buffer, out_f)

def main():
args = parse_write_arguments()
cnt = args.loop
output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
pathlib.Path(output_file).unlink(missing_ok=True)
file_sz = args.mb_size*(1024**2)
app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cuda', requires_grad=False)

gds_handle = GDSBuilder().load().gds_handle()
gds_buffer = gds_handle.new_pinned_device_tensor(file_sz, torch.empty(0, dtype=torch.uint8, device='cuda', requires_grad=False))

t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, gds_handle, gds_buffer))

gds_t = t.timeit(cnt)
gds_gbs = (cnt*file_sz)/GIGA_UNIT/gds_t
print(f'gds store_gpu: {file_sz/GIGA_UNIT} GB, {gds_t/cnt} secs, {gds_gbs:5.2f} GB/sec')

if args.validate:
import tempfile, filecmp
from py_store_cpu_tensor import file_write as py_file_write
py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
py_file_write(py_ref_file, app_tensor)
filecmp.clear_cache()
print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')

gds_handle.free_pinned_device_tensor(gds_buffer)
pathlib.Path(output_file).unlink(missing_ok=True)

if __name__ == "__main__":
main()
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 22 additions & 0 deletions deepnvme/file_access/py_load_cpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import torch
import os, timeit, functools
from utils import parse_read_arguments, GIGA_UNIT

def file_read(inp_f):
with open(inp_f, 'rb') as f:
tensor = torch.frombuffer(f.read(), dtype=torch.uint8)
return tensor

def main():
args = parse_read_arguments()
input_file = args.input_file
file_sz = os.path.getsize(input_file)
cnt = args.loop

t = timeit.Timer(functools.partial(file_read, input_file))
py_t = t.timeit(cnt)
py_gbs = (cnt*file_sz)/GIGA_UNIT/py_t
print(f'py load_cpu: {file_sz/GIGA_UNIT} GB, {py_t/cnt} secs, {py_gbs:5.2f} GB/sec')

if __name__ == "__main__":
main()
22 changes: 22 additions & 0 deletions deepnvme/file_access/py_load_gpu_tensor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import torch
import os, timeit, functools
from utils import parse_read_arguments, GIGA_UNIT

def file_read(inp_f):
with open(inp_f, 'rb') as f:
tensor = torch.frombuffer(f.read(), dtype=torch.uint8)
return tensor.cuda()

def main():
args = parse_read_arguments()
input_file = args.input_file
file_sz = os.path.getsize(input_file)
cnt = args.loop

t = timeit.Timer(functools.partial(file_read, input_file))
py_t = t.timeit(cnt)
py_gbs = (cnt*file_sz)/GIGA_UNIT/py_t
print(f'py load_gpu: {file_sz/GIGA_UNIT} GB, {py_t/cnt} secs, {py_gbs:5.2f} GB/sec')

if __name__ == "__main__":
main()
Loading
Loading