Skip to content

Commit 12bf741

Browse files
committed
differences for PR #18
1 parent 83febe4 commit 12bf741

File tree

2 files changed

+57
-7
lines changed

2 files changed

+57
-7
lines changed

md5sum.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"episodes/optimisation-using-python.md" "0ca314825c3f60c0021d5c1780a06010" "site/built/optimisation-using-python.md" "2025-11-02"
1414
"episodes/optimisation-data-structures-algorithms.md" "0eca1d92564b03b4980a6b2b2c125d8a" "site/built/optimisation-data-structures-algorithms.md" "2025-11-02"
1515
"episodes/long-break1.md" "19a5c42e45032003c36ad8f413f44528" "site/built/long-break1.md" "2024-03-28"
16-
"episodes/optimisation-numpy.md" "d506aafdf7ffc9d0f2ba1582fb774899" "site/built/optimisation-numpy.md" "2025-07-06"
16+
"episodes/optimisation-numpy.md" "699825d965db94da38d2a67ce435e184" "site/built/optimisation-numpy.md" "2025-11-04"
1717
"episodes/optimisation-use-latest.md" "23898ec5fdcf9a712ed346fb82c0baf7" "site/built/optimisation-use-latest.md" "2025-03-15"
1818
"episodes/optimisation-latency.md" "f5b0f79195bc682fe657dd14709de081" "site/built/optimisation-latency.md" "2025-07-06"
1919
"episodes/optimisation-conclusion.md" "e5b7a72db80358823699bca3c1b19cdf" "site/built/optimisation-conclusion.md" "2025-07-06"

optimisation-numpy.md

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -420,11 +420,11 @@ def pandas_apply():
420420
df = genDataFrame()
421421
return df.apply(pythagoras, axis=1)
422422

423-
repeats = 100
423+
repeats = 1000
424424
gentime = timeit(genDataFrame, number=repeats)
425-
print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
426-
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
427-
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")
425+
print(f"for_range: {timeit(for_range, number=int(repeats/20))*20-gentime:.2f}ms") # scale with factor 20, otherwise it takes too long
426+
print(f"for_iterrows: {timeit(for_iterrows, number=int(repeats/20))*20-gentime:.2f}ms")
427+
print(f"pandas_apply: {timeit(pandas_apply, number=int(repeats/20))*20-gentime:.2f}ms")
428428
```
429429

430430
`apply()` is 4x faster than the two `for` approaches, as it avoids the Python `for` loop.
@@ -438,11 +438,56 @@ pandas_apply: 390.49ms
438438

439439
However, rows don't exist in memory as arrays (columns do!), so `apply()` does not take advantage of NumPy's vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy.
440440

441+
::::::::::::::::::::::::::::::::::::: challenge
442+
443+
We can extract the individual columns of the data frame. These are of the type `pandas.Series`, which supports array broadcasting, just like a NumPy array.
444+
Instead of using the `pythagoras(row)` function, can you write a vectorised version of this calculation?
445+
441446
```python
442447
def vectorize():
443448
df = genDataFrame()
444-
return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))
445-
449+
vertical = df["f_vertical"]
450+
horizontal = df["f_horizontal"]
451+
452+
# Your code goes here
453+
454+
return pandas.Series(results)
455+
```
456+
457+
Once you’ve done that, measure your performance by running
458+
```python
459+
repeats = 1000
460+
gentime = timeit(genDataFrame, number=repeats)
461+
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
462+
```
463+
What result do you find? Does this match your expectations?
464+
465+
:::::::::::::::::::::::: hint
466+
467+
Remember that most common mathematical operations work element-wise when applied to NumPy arrays.
468+
For example:
469+
470+
```python
471+
ar = np.array([1,2,3])
472+
473+
print(ar**2) # array([1, 4, 9])
474+
print(ar + ar) # array([2, 4, 6])
475+
```
476+
477+
:::::::::::::::::::::::::::::::::
478+
479+
:::::::::::::::::::::::: solution
480+
481+
```python
482+
def vectorize():
483+
df = genDataFrame()
484+
vertical = df["f_vertical"]
485+
horizontal = df["f_horizontal"]
486+
487+
result = numpy.sqrt(vertical**2 + horizontal**2)
488+
489+
return pandas.Series(result)
490+
446491
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
447492
```
448493

@@ -452,6 +497,11 @@ print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
452497
vectorize: 1.48ms
453498
```
454499

500+
:::::::::::::::::::::::::::::::::
501+
502+
:::::::::::::::::::::::::::::::::::::::::::::::
503+
504+
455505
It won't always be possible to take full advantage of vectorisation, for example you may have conditional logic.
456506

457507
An alternate approach is converting your DataFrame to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension:

0 commit comments

Comments
 (0)