microsoft · tjruwase · Aug 21, 2024 · Aug 4, 2024 · Aug 5, 2024 · Aug 5, 2024
@@ -0,0 +1,108 @@
+# Using DeepNVMe for simple file reads and writes involving CPU/GPU tensors
+
+The purpose of this folder is to provide example codes that illustrate how to use DeepNVMe for simple file operations of moving raw data bytes between persistent storage and CPU/GPU tensors. For each file operation, we provide an implementation using Python I/O functionality, and a DeepNVMe implementation using CPU bounce buffer (aio) and NVIDIA GPUDirect Storage (GDS) as appropriate. 
+
+The following table is a mapping of file operations to the corresponding Python and DeepNVMe implementations. 
+
+
+File Operation | Python | DeepNVMe (aio) | DeepNVMe (GDS)
+|---|---|---|---|
+Load CPU tensor from file | py_load_cpu_tensor.py | aio_load_cpu_tensor.py | - |
+Load GPU tensor from file | py_load_gpu_tensor.py | aio_load_gpu_tensor.py | gds_load_gpu_tensor.py |
+Store CPU tensor to file | py_store_cpu_tensor.py | aio_store_cpu_tensor.py | - |
+Store GPU tensor to file | py_store_gpu_tensor.py | aio_store_gpu_tensor.py | gds_store_gpu_tensor.py |  
+
+The Python implementations are the scripts with `py_` prefix. while the DeepNVMe implementations are those with`aio_` and `gds_`prefixes. 
+
+## Requirements 
+Ensure your environment is properly configured to run these examples. First, you need to install DeepSpeed version >= 0.15.0. Next, ensure that the DeepNVMe operators are available in the DeepSpeed installation. The `async_io` operator is required for any DeepNVMe functionality, while the `gds` operator is required only for GDS functionality. You can confirm availability of each operator by inspecting the output of `ds_report` to check that compatible status is <span style="color:green">[OKAY]</span>. Below is a snippet of `ds_report` output showing availability of both `async_io` and `gds` operators. 
+
+<div align="center">
+    <img src="./media/deepnvme_ops_report.png" style="width:6.5in;height:3.42153in" />
+</div> 
+
+<div align="center">
+    ds_report output showing availability of DeepNVMe operators (async_io and gds) in a DeepSpeed installation. 
+</div> 
+
+
+
+## Tensor Load Examples
+The tensor load example scripts share a common command-line interface, which is illustrated below using `py_read_load_cpu_tensor.py`.
+```bash
+$ python py_load_cpu_tensor.py --help
+usage: py_load_cpu_tensor.py [-h] --input_file INPUT_FILE [--loop LOOP] [--validate]
+
+options:
+  -h, --help            show this help message and exit
+  --input_file INPUT_FILE
+                        File on NVMe device that will read as input.
+  --loop LOOP           The number of times to repeat the operation (default 3).
+  --validate            Run validation step that compares tensor value against Python file read
+```
+Before running these example scripts ensure that the input file exists on an NVMe device. The `--validate` option is relevant only to the DeepNVme implementations. This option provides minimal correctness checking by comparing against a tensor loaded using Python. We also provide a bash script `run_load_tensor.sh`, which runs all the example tensor load scripts.
+
+
+## Tensor Store Examples
+The tensor store examples share a command-line interface, which is illustrated below using `py_store_cpu_tensor.py`
+```bash
+$ python py_store_cpu_tensor.py --help
+usage: py_store_cpu_tensor.py [-h] --nvme_folder NVME_FOLDER [--mb_size MB_SIZE] [--loop LOOP] [--validate]
+
+options:
+  -h, --help            show this help message and exit
+  --nvme_folder NVME_FOLDER
+                        NVMe folder for file write.
+  --mb_size MB_SIZE     Size of tensor to save in MB (default 1024).
+  --loop LOOP           The number of times to repeat the operation (default 3).
+  --validate            Run validation step that compares tensor value against Python file read
+
+```
+Before running these examples ensure that the output folder exists on an NVMe device and that you have write permission. The `--validate` option is relevant only to the DeepNVMe implementations. This option provides minimal correctness checking by comparing the output file against that created using Python. We also provide a bash script `run_store_tensor.sh`, which runs all the example tensor store scripts.  
+
+
+## Performance Advisory
+Although this folder is primarily meant to help with integrating DeepNVMe into your Deep Learning applications, the example scripts also print out performance numbers of read and write throughput. So, we expect you will observe some performance advantage of DeepNVMe compared to Python. However, do note that it is likely that better performance can be realized by tuning DeepNVMe for your environment. Such tuning efforts will ideally generate more optimal values for configuring DeepNVMe. 
+
+For reference, DeepNVMe configuration using hard-coded constants for `aio_` implementations is as follows:
+
+```python
+    aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
+```
+
+The corresponding DeepNVMe configuration for `gds_` implementations is as follows:
+
+```python
+    gds_handle = GDSBuilder().load().gds_handle(1024**2, 128, True, True, 1)
+```
+
+Despite the above caveat, it seems that some performance numbers would be useful here to help set the right expectations. The experiments were conducted on an Azure [NC80adis_H100_v5](https://learn.microsoft.com/en-us/azure/virtual-machines/ncads-h100-v5) series virtual machine (VM). This VM includes two 3.5TB local NVMe devices (labelled Microsoft NVMe Direct Disk v2) that we combined into a single RAID-0 volume. The software environment included Ubuntu 22.04.4 LTS, Linux kernel 6.5.0-26-generic, Pytorch 2.4, and CUDA 12.4. We ran experiments of 1GB data transfers using the unmodified scripts, i.e., without DeepNVMe tuning, and present the throughput results in the tables below. In summary, we observed that DeepNVMe significantly accelerates I/O operations compared to Python. DeepNVMe is 8-16X faster for loading tensor data, and 11X-119X faster for writing tensor data. 
+
+Load 1GB CPU tensor (1GB file read) | GB/sec | Speedup over Python | 
+|---|---|---|
+py_load_cpu_tensor.py  | 1.5 | - | 
+aio_load_cpu_tensor.py | 12.3 | 8X | 
+
+Load 1GB GPU tensor (1GB file read) | GB/sec | Speedup over Python | 
+|---|---|---|
+py_load_gpu_tensor.py | 0.7| - | 
+aio_load_gpu_tensor.py | 9.9 | 14X | 
+gds_load_gpu_tensor.py | 11.1 | 16X | 
+
+
+Store 1GB CPU tensor (1GB file write) | GB/sec | Speedup over Python | 
+|---|---|---|
+py_store_cpu_tensor.py  | 0.7 | - | 
+aio_store_cpu_tensor.py | 8.1 | 11X | 
+
+
+Store 1GB GPU tensor (1GB file write) | GB/sec | Speedup over Python | 
+|---|---|---|
+py_store_gpu_tensor.py | 0.5 | - | 
+aio_store_gpu_tensor.py | 8.3 | 18X | 
+gds_store_gpu_tensor.py | 8.6 | 19X | 
+
+
+
+# Conclusion
+We hope you find this document and example scripts useful for integrating DeepNVMe into your applications. 
@@ -0,0 +1,31 @@
+import torch
+import os, timeit, functools
+from deepspeed.ops.op_builder import AsyncIOBuilder
+from utils import parse_read_arguments, GIGA_UNIT
+
+def file_read(inp_f, h, bounce_buffer):
+    h.sync_pread(bounce_buffer, inp_f)
+    return bounce_buffer.cpu()
+
+def main():
+    args = parse_read_arguments()
+    input_file = args.input_file
+    file_sz = os.path.getsize(input_file)
+    cnt = args.loop
+
+    aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
+    bounce_buffer = torch.empty(os.path.getsize(input_file), dtype=torch.uint8).pin_memory()
+
+    t = timeit.Timer(functools.partial(file_read, input_file, aio_handle, bounce_buffer))
+    aio_t = t.timeit(cnt)
+    aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
+    print(f'aio load_cpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        from py_load_cpu_tensor import file_read as py_file_read 
+        aio_tensor = file_read(input_file, aio_handle, bounce_buffer)
+        py_tensor = py_file_read(input_file)
+        print(f'Validation success = {aio_tensor.equal(py_tensor)}')
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,32 @@
+import torch
+import os, timeit, functools
+from deepspeed.ops.op_builder import AsyncIOBuilder
+from utils import parse_read_arguments, GIGA_UNIT
+
+def file_read(inp_f, h, bounce_buffer):
+    h.sync_pread(bounce_buffer, inp_f)
+    return bounce_buffer.cuda()
+
+
+def main():
+    args = parse_read_arguments()
+    input_file = args.input_file
+    file_sz = os.path.getsize(input_file)
+    cnt = args.loop
+
+    aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
+    bounce_buffer = torch.empty(os.path.getsize(input_file), dtype=torch.uint8).pin_memory()
+
+    t = timeit.Timer(functools.partial(file_read, input_file, aio_handle, bounce_buffer))
+    aio_t = t.timeit(cnt)
+    aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
+    print(f'aio load_gpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        from py_load_cpu_tensor import file_read as py_file_read 
+        aio_tensor = file_read(input_file, aio_handle, bounce_buffer).cpu()
+        py_tensor = py_file_read(input_file)
+        print(f'Validation success = {aio_tensor.equal(py_tensor)}')
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,40 @@
+import torch
+import os, timeit, functools, pathlib
+from deepspeed.ops.op_builder import AsyncIOBuilder
+from utils import parse_write_arguments, GIGA_UNIT
+
+def file_write(out_f, t, h, bounce_buffer):
+    bounce_buffer.copy_(t)
+    h.sync_pwrite(bounce_buffer, out_f)
+
+def main():
+    args = parse_write_arguments()
+    cnt = args.loop
+    output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
+    pathlib.Path(output_file).unlink(missing_ok=True)
+    file_sz = args.mb_size*(1024**2)
+    app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cpu', requires_grad=False)
+
+    aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
+    bounce_buffer = torch.empty(file_sz, dtype=torch.uint8, requires_grad=False).pin_memory()
+
+
+    t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, aio_handle, bounce_buffer))
+
+    aio_t = t.timeit(cnt)
+    aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
+    print(f'aio store_cpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        import tempfile, filecmp
+        from py_store_cpu_tensor import file_write as py_file_write 
+        py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
+        py_file_write(py_ref_file, app_tensor)
+        filecmp.clear_cache()
+        print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')
+
+    pathlib.Path(output_file).unlink(missing_ok=True)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,40 @@
+import torch
+import os, timeit, functools, pathlib
+from deepspeed.ops.op_builder import AsyncIOBuilder
+from utils import parse_write_arguments, GIGA_UNIT
+
+def file_write(out_f, t, h, bounce_buffer):
+    bounce_buffer.copy_(t)
+    h.sync_pwrite(bounce_buffer, out_f)
+
+def main():
+    args = parse_write_arguments()
+    cnt = args.loop
+    output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
+    pathlib.Path(output_file).unlink(missing_ok=True)
+    file_sz = args.mb_size*(1024**2)
+    app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cuda', requires_grad=False)
+
+    aio_handle = AsyncIOBuilder().load().aio_handle(1024**2, 128, True, True, 1)
+    bounce_buffer = torch.empty(file_sz, dtype=torch.uint8, requires_grad=False).pin_memory()
+
+
+    t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, aio_handle, bounce_buffer))
+
+    aio_t = t.timeit(cnt)
+    aio_gbs = (cnt*file_sz)/GIGA_UNIT/aio_t
+    print(f'aio store_gpu: {file_sz/GIGA_UNIT} GB, {aio_t/cnt} secs, {aio_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        import tempfile, filecmp
+        from py_store_cpu_tensor import file_write as py_file_write 
+        py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
+        py_file_write(py_ref_file, app_tensor)
+        filecmp.clear_cache()
+        print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')
+
+    pathlib.Path(output_file).unlink(missing_ok=True)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,33 @@
+import torch
+import os, timeit, functools
+from utils import parse_read_arguments, GIGA_UNIT
+from deepspeed.ops.op_builder import GDSBuilder
+
+def file_read(inp_f, h, gpu_buffer):
+    h.sync_pread(gpu_buffer, inp_f)
+    return gpu_buffer.cuda()
+
+def main():
+    args = parse_read_arguments()
+    input_file = args.input_file
+    file_sz = os.path.getsize(input_file)
+    cnt = args.loop
+
+    gds_handle = GDSBuilder().load().gds_handle(1024**2, 128, True, True, 1)
+    gds_buffer = gds_handle.new_pinned_device_tensor(file_sz, torch.empty(0, dtype=torch.uint8, device='cuda', requires_grad=False))
+
+    t = timeit.Timer(functools.partial(file_read, input_file, gds_handle, gds_buffer))
+    gds_t = t.timeit(cnt)
+    gds_gbs = (cnt*file_sz)/GIGA_UNIT/gds_t
+    print(f'gds load_gpu: {file_sz/GIGA_UNIT} GB, {gds_t/cnt} secs, {gds_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        from py_load_cpu_tensor import file_read as py_file_read 
+        aio_tensor = file_read(input_file, gds_handle, gds_buffer).cpu()
+        py_tensor = py_file_read(input_file)
+        print(f'Validation success = {aio_tensor.equal(py_tensor)}')
+
+    gds_handle.free_pinned_device_tensor(gds_buffer)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,39 @@
+import torch
+import os, timeit, functools, pathlib
+from deepspeed.ops.op_builder import GDSBuilder
+from utils import parse_write_arguments, GIGA_UNIT
+
+def file_write(out_f, t, h, gpu_buffer):
+    gpu_buffer.copy_(t)
+    h.sync_pwrite(gpu_buffer, out_f)
+
+def main():
+    args = parse_write_arguments()
+    cnt = args.loop
+    output_file = os.path.join(args.nvme_folder, f'test_ouput_{args.mb_size}MB.pt')
+    pathlib.Path(output_file).unlink(missing_ok=True)
+    file_sz = args.mb_size*(1024**2)
+    app_tensor = torch.empty(file_sz, dtype=torch.uint8, device='cuda', requires_grad=False)
+
+    gds_handle = GDSBuilder().load().gds_handle(1024**2, 128, True, True, 1)
+    gds_buffer = gds_handle.new_pinned_device_tensor(file_sz, torch.empty(0, dtype=torch.uint8, device='cuda', requires_grad=False))
+
+    t = timeit.Timer(functools.partial(file_write, output_file, app_tensor, gds_handle, gds_buffer))
+
+    gds_t = t.timeit(cnt)
+    gds_gbs = (cnt*file_sz)/GIGA_UNIT/gds_t
+    print(f'gds store_gpu: {file_sz/GIGA_UNIT} GB, {gds_t/cnt} secs, {gds_gbs:5.2f} GB/sec')
+
+    if args.validate: 
+        import tempfile, filecmp
+        from py_store_cpu_tensor import file_write as py_file_write 
+        py_ref_file = os.path.join(tempfile.gettempdir(), os.path.basename(output_file))
+        py_file_write(py_ref_file, app_tensor)
+        filecmp.clear_cache()
+        print(f'Validation success = {filecmp.cmp(py_ref_file, output_file, shallow=False) }')
+
+    gds_handle.free_pinned_device_tensor(gds_buffer)
+    pathlib.Path(output_file).unlink(missing_ok=True)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,22 @@
+import torch
+import os, timeit, functools
+from utils import parse_read_arguments, GIGA_UNIT
+
+def file_read(inp_f):
+    with open(inp_f, 'rb') as f:
+       t = torch.frombuffer(f.read(), dtype=torch.uint8)
+    return t 
+
+def main():
+    args = parse_read_arguments()
+    input_file = args.input_file
+    file_sz = os.path.getsize(input_file)
+    cnt = args.loop
+
+    t = timeit.Timer(functools.partial(file_read, input_file))
+    py_t = t.timeit(cnt)
+    py_gbs = (cnt*file_sz)/GIGA_UNIT/py_t
+    print(f'py load_cpu: {file_sz/GIGA_UNIT} GB, {py_t/cnt} secs, {py_gbs:5.2f} GB/sec')
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,22 @@
+import torch
+import os, timeit, functools
+from utils import parse_read_arguments, GIGA_UNIT
+
+def file_read(inp_f):
+    with open(inp_f, 'rb') as f:
+       t = torch.frombuffer(f.read(), dtype=torch.uint8)
+    return t.cuda()
+
+def main():
+    args = parse_read_arguments()
+    input_file = args.input_file
+    file_sz = os.path.getsize(input_file)
+    cnt = args.loop
+
+    t = timeit.Timer(functools.partial(file_read, input_file))
+    py_t = t.timeit(cnt)
+    py_gbs = (cnt*file_sz)/GIGA_UNIT/py_t
+    print(f'py load_gpu:  {file_sz/GIGA_UNIT} GB, {py_t/cnt} secs, {py_gbs:5.2f} GB/sec')
+
+if __name__ == "__main__":
+    main()