Skip to content

Commit

Permalink
follow comments
Browse files Browse the repository at this point in the history
  • Loading branch information
kiya00 committed Dec 12, 2024
1 parent 9b22a9e commit 9d2bb2e
Showing 1 changed file with 19 additions and 17 deletions.
36 changes: 19 additions & 17 deletions notebooks/hello_world_thunderfx.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
"source": [
"## \"Hello, World!\" ThunderFX\n",
"\n",
"In this tutorial, we’ll explore how to use ThunderFX to accelerate PyTorch program.\n",
"In this tutorial, we’ll explore how to use ThunderFX to accelerate a PyTorch program.\n",
"\n",
"We’ll cover the basics of ThunderFX, demonstrate how to apply it to PyTorch functions and models, and evaluate its performance in both inference and gradient calculations."
"We’ll cover the basics of ThunderFX, demonstrate how to apply it to PyTorch functions and models, and evaluate its performance in both inference (forward-only) and training (forward and backward)."
]
},
{
Expand Down Expand Up @@ -83,7 +83,7 @@
"\n",
"Next, let’s evaluate how ThunderFX improves performance on a real-world model. We'll use the Llama3 model as an example and compare the execution time for both inference and gradient calculations.\n",
"\n",
"We begin by loading and configuring a lightweight version of the Llama3 model:"
"We begin by loading and configuring a smaller version of the Llama3 model:"
]
},
{
Expand Down Expand Up @@ -139,7 +139,7 @@
"batch_dim = 8\n",
"\n",
"torch.set_default_dtype(torch.bfloat16)\n",
"make = partial(make_tensor, low=0, high=255, device='cuda', dtype=torch.int64, requires_grad=False)\n",
"make = partial(make_tensor, low=0, high=255, device='cuda', dtype=torch.int64)\n",
"\n",
"with torch.device('cuda'):\n",
" model = GPT(cfg)\n",
Expand Down Expand Up @@ -180,7 +180,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: ThunderFX compiles the model into optimized kernels as it executes. This means the first run may take longer due to the compilation process, but subsequent runs will benefit from significant speedups.\n",
"Note: ThunderFX compiles the model into optimized kernels as it executes. Compiling these kernels can take seconds or even minutes for larger models, but each kernel only has to be compiled once, and subsequent runs will benefit from it.\n",
"\n",
"To evaluate ThunderFX’s inference performance, we compare the execution time of the compiled model versus the standard PyTorch model:"
]
Expand All @@ -195,9 +195,9 @@
"output_type": "stream",
"text": [
"ThunderFX Inference Time:\n",
"142 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
"136 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
"Torch Eager Inference Time:\n",
"159 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
"152 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
Expand All @@ -219,7 +219,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, let’s measure the performance improvement for gradient calculations:"
"Similarly, let’s measure the performance improvement for training:"
]
},
{
Expand All @@ -231,18 +231,18 @@
"name": "stdout",
"output_type": "stream",
"text": [
"ThunderFX Gradient Calculation Time:\n",
"441 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
"Torch Eager Gradient Calculation Time:\n",
"480 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
"ThunderFX Training Time:\n",
"427 ms ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
"Torch Eager Training Time:\n",
"465 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"print(\"ThunderFX Gradient Calculation Time:\")\n",
"%timeit r = compiled_model(x); torch.autograd.grad(r.sum(), model.parameters()); torch.cuda.synchronize()\n",
"print(\"Torch Eager Gradient Calculation Time:\")\n",
"%timeit r = model(x); torch.autograd.grad(r.sum(), model.parameters()); torch.cuda.synchronize()"
"print(\"ThunderFX Training Time:\")\n",
"%timeit r = compiled_model(x); r.sum().backward(); torch.cuda.synchronize()\n",
"print(\"Torch Eager Training Time:\")\n",
"%timeit r = model(x); r.sum().backward(); torch.cuda.synchronize()"
]
},
{
Expand All @@ -251,7 +251,9 @@
"source": [
"#### Conclusion\n",
"\n",
"ThunderFX provides an efficient way to accelerate PyTorch programs, particularly for GPU workloads. By compiling functions and models, it reduces runtime for both inference and gradient computations. This tutorial demonstrated its usage and performance benefits using both simple functions and a real-world model."
"ThunderFX can accelerate PyTorch programs, particularly CUDA programs. By compiling optimized kernels specific to the program you're running. It can accelerate both inference (forward-only) and training (forward and backward) computations.\n",
"\n",
"For more information about Thunder and ThunderFX in particular, see https://github.com/Lightning-AI/lightning-thunder/tree/main/notebooks."
]
}
],
Expand Down

0 comments on commit 9d2bb2e

Please sign in to comment.