Issue Description
In attention_partial::launcher::wait_for_kv the code checks the Bar counters for the previous RMS_QKV_MatVecRopeAppend K and V blocks:
// K
while (Bar[{layer_idx,
OPCODE_RMS_QKV_MatVecRopeAppend - 1,
num_attention_heads + kv_head_idx}] < 4) { … }
// V
while (Bar[{layer_idx,
OPCODE_RMS_QKV_MatVecRopeAppend - 1,
num_attention_heads + num_kv_heads + kv_head_idx}] < 4) { … }
There is no analogous polling for the corresponding Q block(s).
Because Q, K and V are generated by the same upstream opcode, I’d expect all three to be needed. What ensures that Q is already available (or otherwise not required) when PartialAttention begins?