Skip to content

Conversation

@JostMigenda
Copy link
Collaborator

As noted in #9, there are currently very few exercises in the optimisation part of the course. In particular, there are none on array broadcasting, despite it being one of the most powerful performance optimisations in the course.

This turns one of the existing examples (vectorising over a pandas.Series) into an exercise. It also includes minor changes to make the existing example more legible.

@JostMigenda JostMigenda requested a review from Robadob November 4, 2025 22:01
@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thank you!

Thank you for your pull request 😃

🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}.

If you have files that automatically render output (e.g. R Markdown), then you should check for the following:

  • 🎯 correct output
  • 🖼️ correct figures
  • ❓ new warnings
  • ‼️ new errors

Rendered Changes

🔍 Inspect the changes: https://github.com/carpentries-incubator/pando-python/compare/md-outputs..md-outputs-PR-18

The following changes were observed in the rendered markdown documents:

 md5sum.txt            |  2 +-
 optimisation-numpy.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 57 insertions(+), 7 deletions(-)
What does this mean?

If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible.

This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation.

⏱️ Updated at 2025-11-04 22:01:55 +0000

github-actions bot pushed a commit that referenced this pull request Nov 4, 2025
Copy link
Collaborator

@Robadob Robadob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the redundant divide repeats by 20 and multiply time by 20.
IIRC 1000 is a lazy gimmick of mine, to save dividing the result to present an average, it's probably just easier to drop the gimmick and refer to it as total time.

Otherwise the new exercise looks good, though it makes me tempted to extend vectorisation coverage to include np.where() and other conditional operations to demonstrate that it's more flexible than assumed.

Approved in principle, though I'd get rid of the scaling you added in practice.

Not looking to increase your workload, so #19 for future.

print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")
print(f"for_range: {timeit(for_range, number=int(repeats/20))*20-gentime:.2f}ms") # scale with factor 20, otherwise it takes too long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the scaling kind of redundant complexity here? Just repeat it 50 times and have lower absolute times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants