Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 56 additions & 6 deletions episodes/optimisation-numpy.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,11 +420,11 @@ def pandas_apply():
df = genDataFrame()
return df.apply(pythagoras, axis=1)

repeats = 100
repeats = 1000
gentime = timeit(genDataFrame, number=repeats)
print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")
print(f"for_range: {timeit(for_range, number=int(repeats/20))*20-gentime:.2f}ms") # scale with factor 20, otherwise it takes too long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the scaling kind of redundant complexity here? Just repeat it 50 times and have lower absolute times?

print(f"for_iterrows: {timeit(for_iterrows, number=int(repeats/20))*20-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=int(repeats/20))*20-gentime:.2f}ms")
```

`apply()` is 4x faster than the two `for` approaches, as it avoids the Python `for` loop.
Expand All @@ -438,11 +438,56 @@ pandas_apply: 390.49ms

However, rows don't exist in memory as arrays (columns do!), so `apply()` does not take advantage of NumPy's vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy.

::::::::::::::::::::::::::::::::::::: challenge

We can extract the individual columns of the data frame. These are of the type `pandas.Series`, which supports array broadcasting, just like a NumPy array.
Instead of using the `pythagoras(row)` function, can you write a vectorised version of this calculation?

```python
def vectorize():
df = genDataFrame()
return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))

vertical = df["f_vertical"]
horizontal = df["f_horizontal"]

# Your code goes here

return pandas.Series(results)
```

Once you’ve done that, measure your performance by running
```python
repeats = 1000
gentime = timeit(genDataFrame, number=repeats)
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
```
What result do you find? Does this match your expectations?

:::::::::::::::::::::::: hint

Remember that most common mathematical operations work element-wise when applied to NumPy arrays.
For example:

```python
ar = np.array([1,2,3])

print(ar**2) # array([1, 4, 9])
print(ar + ar) # array([2, 4, 6])
```

:::::::::::::::::::::::::::::::::

:::::::::::::::::::::::: solution

```python
def vectorize():
df = genDataFrame()
vertical = df["f_vertical"]
horizontal = df["f_horizontal"]

result = numpy.sqrt(vertical**2 + horizontal**2)

return pandas.Series(result)

print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
```

Expand All @@ -452,6 +497,11 @@ print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")
vectorize: 1.48ms
```

:::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::::::


It won't always be possible to take full advantage of vectorisation, for example you may have conditional logic.

An alternate approach is converting your DataFrame to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension:
Expand Down
Loading