differences for PR #18

actions-user · actions-user · commit 12bf74138d2f · 2025-11-04T22:02:29.000Z
diff --git a/md5sum.txt b/md5sum.txt
@@ -13,7 +13,7 @@
 "episodes/optimisation-using-python.md" "0ca314825c3f60c0021d5c1780a06010" "site/built/optimisation-using-python.md" "2025-11-02"
 "episodes/optimisation-data-structures-algorithms.md" "0eca1d92564b03b4980a6b2b2c125d8a" "site/built/optimisation-data-structures-algorithms.md" "2025-11-02"
 "episodes/long-break1.md" "19a5c42e45032003c36ad8f413f44528" "site/built/long-break1.md" "2024-03-28"
-"episodes/optimisation-numpy.md" "d506aafdf7ffc9d0f2ba1582fb774899" "site/built/optimisation-numpy.md" "2025-07-06"
+"episodes/optimisation-numpy.md" "699825d965db94da38d2a67ce435e184" "site/built/optimisation-numpy.md" "2025-11-04"
 "episodes/optimisation-use-latest.md" "23898ec5fdcf9a712ed346fb82c0baf7" "site/built/optimisation-use-latest.md" "2025-03-15"
 "episodes/optimisation-latency.md" "f5b0f79195bc682fe657dd14709de081" "site/built/optimisation-latency.md" "2025-07-06"
 "episodes/optimisation-conclusion.md" "e5b7a72db80358823699bca3c1b19cdf" "site/built/optimisation-conclusion.md" "2025-07-06"
diff --git a/optimisation-numpy.md b/optimisation-numpy.md
@@ -420,11 +420,11 @@ def pandas_apply():
     df = genDataFrame()
     return df.apply(pythagoras, axis=1)
 
-repeats = 100
+repeats = 1000
 gentime = timeit(genDataFrame, number=repeats)
-print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
-print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
-print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")
+print(f"for_range: {timeit(for_range, number=int(repeats/20))*20-gentime:.2f}ms")  # scale with factor 20, otherwise it takes too long
+print(f"for_iterrows: {timeit(for_iterrows, number=int(repeats/20))*20-gentime:.2f}ms")
+print(f"pandas_apply: {timeit(pandas_apply, number=int(repeats/20))*20-gentime:.2f}ms")
 ```
 
 `apply()` is 4x faster than the two `for` approaches, as it avoids the Python `for` loop.
@@ -438,11 +438,56 @@ pandas_apply: 390.49ms
 
 However, rows don't exist in memory as arrays (columns do!), so `apply()` does not take advantage of NumPy's vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy.
 
+::::::::::::::::::::::::::::::::::::: challenge
+
+We can extract the individual columns of the data frame. These are of the type `pandas.Series`, which supports array broadcasting, just like a NumPy array.
+Instead of using the `pythagoras(row)` function, can you write a vectorised version of this calculation?
+
 ```python
 def vectorize():
     df = genDataFrame()
-    return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))
-    
+    vertical = df["f_vertical"]
+    horizontal = df["f_horizontal"]
+
+    # Your code goes here
+
+    return pandas.Series(results)
+```
+
+Once you’ve done that, measure your performance by running
+```python
+repeats = 1000
+gentime = timeit(genDataFrame, number=repeats)
+print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
+```
+What result do you find? Does this match your expectations?
+
+:::::::::::::::::::::::: hint
+
+Remember that most common mathematical operations work element-wise when applied to NumPy arrays.
+For example:
+
+```python
+ar = np.array([1,2,3])
+
+print(ar**2)  # array([1, 4, 9])
+print(ar + ar)  # array([2, 4, 6])
+```
+
+:::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::: solution
+
+```python
+def vectorize():
+    df = genDataFrame()
+    vertical = df["f_vertical"]
+    horizontal = df["f_horizontal"]
+
+    result = numpy.sqrt(vertical**2 + horizontal**2)
+
+    return pandas.Series(result)
+
 print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
 ```
 
@@ -452,6 +497,11 @@ print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
 vectorize: 1.48ms
 ```
 
+:::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::::::::::
+
+
 It won't always be possible to take full advantage of vectorisation, for example you may have conditional logic.
 
 An alternate approach is converting your DataFrame to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension: