follow comments

Lightning-AI · Dec 12, 2024 · 9d2bb2e · 9d2bb2e
1 parent 9b22a9e
commit 9d2bb2e
Showing 1 changed file with 19 additions and 17 deletions.
diff --git a/notebooks/hello_world_thunderfx.ipynb b/notebooks/hello_world_thunderfx.ipynb
@@ -6,9 +6,9 @@
    "source": [
     "## \"Hello, World!\" ThunderFX\n",
     "\n",
-    "In this tutorial, we’ll explore how to use ThunderFX to accelerate PyTorch program.\n",
+    "In this tutorial, we’ll explore how to use ThunderFX to accelerate a PyTorch program.\n",
     "\n",
-    "We’ll cover the basics of ThunderFX, demonstrate how to apply it to PyTorch functions and models, and evaluate its performance in both inference and gradient calculations."
+    "We’ll cover the basics of ThunderFX, demonstrate how to apply it to PyTorch functions and models, and evaluate its performance in both inference (forward-only) and training (forward and backward)."
    ]
   },
   {
@@ -83,7 +83,7 @@
     "\n",
     "Next, let’s evaluate how ThunderFX improves performance on a real-world model. We'll use the Llama3 model as an example and compare the execution time for both inference and gradient calculations.\n",
     "\n",
-    "We begin by loading and configuring a lightweight version of the Llama3 model:"
+    "We begin by loading and configuring a smaller version of the Llama3 model:"
    ]
   },
   {
@@ -139,7 +139,7 @@
     "batch_dim = 8\n",
     "\n",
     "torch.set_default_dtype(torch.bfloat16)\n",
-    "make = partial(make_tensor, low=0, high=255, device='cuda', dtype=torch.int64, requires_grad=False)\n",
+    "make = partial(make_tensor, low=0, high=255, device='cuda', dtype=torch.int64)\n",
     "\n",
     "with torch.device('cuda'):\n",
     "    model = GPT(cfg)\n",
@@ -180,7 +180,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note: ThunderFX compiles the model into optimized kernels as it executes. This means the first run may take longer due to the compilation process, but subsequent runs will benefit from significant speedups.\n",
+    "Note: ThunderFX compiles the model into optimized kernels as it executes. Compiling these kernels can take seconds or even minutes for larger models, but each kernel only has to be compiled once, and subsequent runs will benefit from it.\n",
     "\n",
     "To evaluate ThunderFX’s inference performance, we compare the execution time of the compiled model versus the standard PyTorch model:"
    ]
@@ -195,9 +195,9 @@
      "output_type": "stream",
      "text": [
       "ThunderFX Inference Time:\n",
-      "142 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
+      "136 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
       "Torch Eager Inference Time:\n",
-      "159 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+      "152 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
      ]
     }
    ],
@@ -219,7 +219,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Similarly, let’s measure the performance improvement for gradient calculations:"
+    "Similarly, let’s measure the performance improvement for training:"
    ]
   },
   {
@@ -231,18 +231,18 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "ThunderFX Gradient Calculation Time:\n",
-      "441 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
-      "Torch Eager Gradient Calculation Time:\n",
-      "480 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+      "ThunderFX Training Time:\n",
+      "427 ms ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
+      "Torch Eager Training Time:\n",
+      "465 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
      ]
     }
    ],
    "source": [
-    "print(\"ThunderFX Gradient Calculation Time:\")\n",
-    "%timeit r = compiled_model(x); torch.autograd.grad(r.sum(), model.parameters()); torch.cuda.synchronize()\n",
-    "print(\"Torch Eager Gradient Calculation Time:\")\n",
-    "%timeit r = model(x); torch.autograd.grad(r.sum(), model.parameters()); torch.cuda.synchronize()"
+    "print(\"ThunderFX Training Time:\")\n",
+    "%timeit r = compiled_model(x); r.sum().backward(); torch.cuda.synchronize()\n",
+    "print(\"Torch Eager Training Time:\")\n",
+    "%timeit r = model(x); r.sum().backward(); torch.cuda.synchronize()"
    ]
   },
   {
@@ -251,7 +251,9 @@
    "source": [
     "#### Conclusion\n",
     "\n",
-    "ThunderFX provides an efficient way to accelerate PyTorch programs, particularly for GPU workloads. By compiling functions and models, it reduces runtime for both inference and gradient computations. This tutorial demonstrated its usage and performance benefits using both simple functions and a real-world model."
+    "ThunderFX can accelerate PyTorch programs, particularly CUDA programs. By compiling optimized kernels specific to the program you're running. It can accelerate both inference (forward-only) and training (forward and backward) computations.\n",
+    "\n",
+    "For more information about Thunder and ThunderFX in particular, see https://github.com/Lightning-AI/lightning-thunder/tree/main/notebooks."
    ]
   }
  ],