Skip to content

Commit

Permalink
Update jobbole#114 90 分钟学习现代微处理器.md
Browse files Browse the repository at this point in the history
  • Loading branch information
white-rabit authored Jul 8, 2019
1 parent 8701349 commit 2c81549
Showing 1 changed file with 36 additions and 0 deletions.
36 changes: 36 additions & 0 deletions translation/#114 90 分钟学习现代微处理器.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,27 +323,48 @@ Instruction Dependencies & Latencies

How far can pipelining and multiple issue be taken? If a 5-stage pipeline is 5 times faster, why not build a 20-stage superpipeline? If 4-issue superscalar is good, why not go for 8-issue? For that matter, why not build a processor with a 50-stage pipeline which issues 20 instructions per cycle?

流水线和多发技术能走多远?如果 5 级流水线能提升 5 倍的速度,为什么不建造 20 级超级流水线呢?如果 4 并发超标量能有好的效果,为什么不采用 8 并发呢?如此说来,为什么不构建一个拥有 50 级流水线,每个周期处理 20 个指令的处理器呢?

Well, consider the following two instructions...

考虑以下两个指令…

a = b * c;
d = a + 1;

The second instruction _depends_ on the first – the processor can't execute the second instruction until after the first has completed calculating its result. This is a serious problem, because instructions that depend on each other cannot be executed in parallel. Thus, multiple issue is impossible in this case.

第二条指令*依赖*于第一条指令 —— 在第一条指令计算出结果之前,处理器无法执行第二条指令。这是一个严重的问题,因为相互依赖的指令不能并行执行。因此,在这种情况下,多发技术并不适用。

If the first instruction was a simple integer addition then this might still be okay in a pipelined _single-issue_ processor, because integer addition is quick and the result of the first instruction would be available just in time to feed it back into the next instruction (using bypasses). However in the case of a multiply, which will take several cycles to complete, there is no way the result of the first instruction will be available when the second instruction reaches the execute stage just one cycle later. So, the processor will need to stall the execution of the second instruction until its data is available, inserting a _bubble_ into the pipeline where no work gets done.

如果第一条指令是简单的整数加法,那么在流水线*单发*处理器中并不会有问题。因为整数加法很快,第一条指令的结果能够及时返回,以供下一条指令使用(采用旁路)。然而,乘法需要几个周期才能完成。当第二条指令在仅一个周期后到达执行阶段时,第一条指令的结果还未返回。因此,处理器需要暂停第二条指令的执行,直到它需要的数据可用。此时,流水线会在没有指令的地方插入一个气泡。

It can be confusing when the word "latency" is used for related, but different, meanings. Here, I'm talking about the latency as seen by a compiler. Some hardware engineers may think of latency as the number of cycles required for execution (the number of pipeline stages). So a hardware engineer might say the instructions in a simple integer pipeline have a latency of 5 but a throughput of 1, whereas from a compiler's point of view they have a latency of 1 because their results are available for use in the very next cycle. The compiler view is the more common, and is generally used even in hardware manuals.

使用“延迟”来表示一个与之相关但略有不同的含义,也许会带来一些困惑。在这里,我指的是编译器中的延迟。一些硬件工程师可能认为延迟是指令执行所需的周期数(流水线阶段数)。因此,硬件工程师可能会说,指令在简单的整型流水线中延迟为 5,吞吐量为 1。而从编译器的角度来看,它们的延迟为 1,因为它们的结果可在下一个周期中使用。从编译器的角度来描述更为常见,甚至在硬件手册中也普遍使用。

The number of cycles between when an instruction reaches the execute stage and when its result is available for use by other instructions is called the instruction's _latency_. The deeper the pipeline, the more stages and thus the longer the latency. So a very deep pipeline is not much more effective than a short one, because a deep one just gets filled up with bubbles thanks to all those nasty instructions depending on each other.

指令的延迟指一条指令从到达执行阶段起至它的结果可供其他指令使用为止所经过的周期数。流水线越深,阶段越多,延迟也就越久。因此,一条很深的流水线并不比一条较短的流水线更有效,因为指令间烦人的依赖关系使得深层流水线之中充满气泡。

From a compiler's point of view, typical latencies in modern processors range from a single cycle for integer operations, to around 3-6 cycles for floating-point addition and the same or perhaps slightly longer for multiplication, through to over a dozen cycles for integer division.

从编译器的角度来看,现代处理器的典型延迟范围一般包括整数运算的单个周期,浮点加法的大约 3 - 6 个周期,乘法大约相同或稍长的周期,以及整数除法的十几个周期。

Latencies for memory loads are particularly troublesome, in part because they tend to occur early within code sequences, which makes it difficult to fill their delays with useful instructions, and equally importantly because they are somewhat unpredictable – the load latency varies a lot depending on whether the access is a cache hit or not (we'll get to caches later).

内存加载的延迟特别麻烦,一部分原因是因为它们往往出现在代码序列的早期,因而很难用有用的指令来填充这段延时。同样重要的是,它们还有些不可预测 —— 负载的延迟很大程度上取决于访问是否为缓存命中,因而变化很大(稍后详谈缓存)。


Branches & Branch Prediction
----------------------------
分支与分支预测
----------------------------

Another key problem for pipelining is branches. Consider the following code sequence...
流水线的另一个关键问题是分支。例如,以下代码序列:


if (a > 7) {
b = c;
Expand All @@ -352,6 +373,7 @@ if (a > 7) {
}

...which compiles into something like...
将被编译为:

cmp a, 7 ; a > 7 ?
ble L1
Expand All @@ -362,15 +384,29 @@ L2: ...

Now consider a pipelined processor executing this code sequence. By the time the conditional branch at line 2 reaches the execute stage in the pipeline, the processor must have already fetched and decoded the next couple of instructions. But _which_ instructions? Should it fetch and decode the _if_ branch (lines 3 and 4) or the _else_ branch (line 5)? It won't really know until the conditional branch gets to the execute stage, but in a deeply pipelined processor that might be several cycles away. And it can't afford to just wait – the processor encounters a branch every six instructions on average, and if it was to wait several cycles at every branch then most of the performance gained by using pipelining in the first place would be lost.

想象一下流水线处理器处理以上代码序列的过程。当第 2 行的条件分支到达流水线中的执行阶段时,处理器必须已经获取并解码了后面的一些指令。*哪些*指令呢?是应该提取并解码 *if* 分支(第 3 行和第 4 行)呢还是 *else* 分支(第 5 行)呢?这个问题,直到条件分支到达执行阶段,才能真正决定。但是在一个深度流水线的处理器中,这可能已经过去了几个周期。这样等待的代价是不能接受的 —— 处理器平均每六条指令就会遇到一个分支,如果它在每个分支上都等待几个周期,那么大多数情况下,首选流水线的优势将不复存在。

So the processor must make a _guess_. The processor will then fetch down the path it guessed and _speculatively_ begin executing those instructions. Of course, it won't be able to actually commit (writeback) those instructions until the outcome of the branch is known. Worse, if the guess is wrong the instructions will have to be cancelled, and those cycles will have been wasted. But if the guess is correct, the processor will be able to continue on at full speed.

所以处理器必须做出*猜测*, 然后提取它所猜测的路径,并*试探性*地开始执行这些指令。当然,在得到条件分支的结果之前,它将无法实际提交(回写)那些指令。更糟糕的是,如果猜错了分支,那么那些指令就需要被取消,相当于浪费了那些周期。但如果猜测是正确的,处理器就可以保持全速运行。

The key question is _how_ the processor should make the guess. Two alternatives spring to mind. First, the _compiler_ might be able to mark the branch to tell the processor which way to go. This is called _static branch prediction_. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted not-taken. More importantly, however, this approach requires the compiler to be quite smart in order for it to make the correct guess, which is easy for loops but might be difficult for other branches.

问题的关键在于处理器*如何*进行猜测。有两种方式可供选择:首先,*编译器*或许能够标记分支,以此告诉处理器该选择哪条分支。这称为*静态分支预测*。最理想的是在指令中用 1 比特的数据储存预测结果。但这并不适用于比较古老的体系构架。因此,惯常的做法是把向后分支预测为被采取的分支,把向前分支预测为不采取的分支。然而,更重要的是,这种方法要求编译器非常聪明,以便进行正确的猜测,这对于循环分支来说很容易,但对于其他分支可能很困难。

The other alternative is to have the processor make the guess _at runtime_. Normally, this is done by using an on-chip _branch prediction table_ containing the addresses of recent branches and a bit indicating whether each branch was taken or not last time. In reality, most processors actually use two bits, so that a single not-taken occurrence doesn't reverse a generally taken prediction (important for loop back edges). Of course, this dynamic branch prediction table takes up valuable space on the processor chip, but branch prediction is so important that it's well worth it.

另一种选择是让处理器在*运行*时进行猜测。通常情况下,这是通过芯片上内置的分支预测表来完成的。分支预测表中包含最近执行的分支地址,以及 1 比特指示每个分支上次是否执行的标记。实际上,大多数处理器使用 2 比特,因此单个未执行的事件不会影响到通常被执行的预测(这对于循环回边非常重要)。当然,这个动态的分支预测表占用了处理器芯片上的宝贵空间,但是分支预测非常重要,这是非常值得的。

Unfortunately, even the best branch prediction techniques are sometimes wrong, and with a deep pipeline many instructions might need to be cancelled. This is called the _mispredict penalty_. The Pentium Pro/II/III was a good example – it had a 12-stage pipeline and thus a mispredict penalty of 10-15 cycles. Even with a clever dynamic branch predictor that correctly predicted an impressive 90% of the time, this high mispredict penalty meant about 30% of the Pentium Pro/II/III's performance was lost due to mispredictions. Put another way, one third of the time the Pentium Pro/II/III was not doing useful work, but instead was saying "oops, wrong way".

不幸的是,即使是最好的分支预测技术也可能预测错误。对于深层流水线来说,这将造成许多指令的取消。这被称为预测失误惩罚。以奔腾 Pro/II/II 为例, 它拥有 12 级的流水线,因此预测失误惩罚是 10 - 15 个周期。 即使采用正确率 90% 的动态分支预测器,如此高的预测失误惩罚也会造成约 30% 的性能损失。换句话说,三分之一的时间,奔腾 Pro/II/III 没有做有用的工作,而是在做错误的尝试。

Modern processors devote ever more hardware to branch prediction in an attempt to raise the prediction accuracy even further, and reduce this cost. Many record each branch's direction not just in isolation, but in the context of the couple of branches leading up to it, which is called a _two-level adaptive_ predictor. Some keep a more global branch history, rather than a separate history for each individual branch, in an attempt to detect any correlations between branches even if they're relatively far away in the code. That's called a _gshare_ or _gselect_ predictor. The most advanced modern processors often implement _several_ branch predictors and select between them based on which one seems to be working best for each individual branch!

现代处理器投入了甚至更多的硬件来进行分支预测,期望可以提高预测的正确率,以便减少错误消耗。许多算法不只孤立的记录每个分支的方向,同时还记录了通向当前分支的一组分支。这种算法被称为*两级自适应预测器*。还有些算法为了探索分支之间的相互关联(即使他们在代码中相隔比较远),保存了更全局的分支历史,而不是单独记录每个分支的历史。这叫做 *GShare 或 GSelect 预测器*。最先进的现代处理器通常内置多个分支预测器,并针对每个单独的分支,在它们之间选择表现最好的那个。


Nonetheless, even the very best modern processors with the best, smartest branch predictors only reach a prediction accuracy of about 95%, and still lose quite a lot of performance due to branch mispredictions. The bottom line is simple – very deep pipelines naturally suffer from _diminishing returns_, because the deeper the pipeline, the further into the future you must try to predict, the more likely you'll be wrong, and the greater the mispredict penalty when you are.

尽管如此,即使是拥有最好、最智能的分支预测器的最优秀的现代处理器,也只能达到 95% 左右的预测精度,依然会因为分支预测失误,损失相当多的性能。底线很简单 —— 太深的流水线一般会受到收益递减的影响。因为流水线越深,需要预测出的指令就会越多,因而出错的可能性会越大,出错时,预测失误的惩罚也就越大。

0 comments on commit 2c81549

Please sign in to comment.