Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writeback conflicts #9

Open
rygorous opened this issue Aug 3, 2021 · 0 comments
Open

Writeback conflicts #9

rygorous opened this issue Aug 3, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@rygorous
Copy link
Contributor

rygorous commented Aug 3, 2021

This is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.

Test case: https://bit.ly/3jlvOOJ

uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the paddd xmm2, xmm3), this does run at 4c/iteration on real HW, and uiCA agrees.

The actual computation here is nonsense, I was just trying to come up with a small repro.

The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.

@andreas-abel andreas-abel added the enhancement New feature or request label Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants