Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-100239: specialize long tail of binary operations #128722

Merged
merged 20 commits into from
Jan 16, 2025

Conversation

iritkatriel
Copy link
Member

@iritkatriel iritkatriel commented Jan 10, 2025

This implements part of #100239: the four arithmetic ops between int, float and float, int.

Microbenchmarks:

Old:

>>> timeit("for i in range(10000):\n\tb = a+i", number=100000, setup="a = 1.0")
37.0931663562078
>>> timeit("for i in range(10000):\n\tb = i+a", number=100000, setup="a = 1.0")
40.84421204589307

New:

>>> timeit("for i in range(10000):\n\tb = a+i", number=100000, setup="a = 1.0")
31.263000949984416
>>> timeit("for i in range(10000):\n\tb = i+a", number=100000, setup="a = 1.0")
31.243564788019285

So performance is 20-30% better, and also more uniform (old is 10% slower for int+float compared to float+int).

Full benchmarks don't show an overall speedup, but they do show better specialisation stats for BINARY_OP:
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250110-3.14.0a3+-7264e37#readme

@Fidget-Spinner
Copy link
Member

Nice! The arithmetic benchmarks show a good speedup in the provided link:

spectral_norm	107 ms	96.3 ms: 1.11x faster
pyflate	480 ms	437 ms: 1.10x faster
chaos	61.1 ms	57.9 ms: 1.06x faster

Copy link
Contributor

@eendebakpt eendebakpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already looks quite good. To get a feeling for the interfaces I added a few more specializations. See iritkatriel/cpython@gh-100239...eendebakpt:cpython:gh-100239-list-tuple-add

Adding more specializations is quite easy, but if we end up adding more we will need some more macros or tooling (such as for example the TRY_BINARY_SPECIALIZATION in the branch above). Fine to leave that to a followup PR though.

Python/specialize.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Include/internal/pycore_code.h Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising. We might want to add some filtering based on the class before calling the guard function when specializing. Calling a long chain of guard functions could be expensive.

Looking at the stats, it seems that this doesn't make that much difference to the number of BINARY_OPs that are specialized.
Unfortunately the stats don't tell us which class pairs to add, but I think str % str and str % tuple would be worth a look.
Or we could enhance the stats to give us cls/cls/operator triples, at least for those classes with a small version number?

Include/internal/pycore_code.h Show resolved Hide resolved
Include/internal/pycore_code.h Outdated Show resolved Hide resolved
Include/internal/pycore_code.h Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Python/specialize.c Outdated Show resolved Hide resolved
Python/specialize.c Show resolved Hide resolved
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestions, but nothing blocking.

OOI what was causing the earlier test failures?

Include/internal/pycore_code.h Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
@iritkatriel
Copy link
Member Author

OOI what was causing the earlier test failures?

This was missing: #128892

The test assumes some valid opcode is invalid.

@iritkatriel iritkatriel merged commit 3893a92 into python:main Jan 16, 2025
65 checks passed
@iritkatriel
Copy link
Member Author

I repeated the benchmarks with the multiply bug (that prevented specialisation) fixed:
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250116-3.14.0a4+-3893a92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants