diff --git "a/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" "b/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" index 70ccc55..da94840 100644 --- "a/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" +++ "b/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" @@ -8,8 +8,8 @@ Table of Contents ----------------- 1. [More Than Just Megahertz 不仅仅是兆赫](#morethanjustmegahertz) -2. [Pipelining & Instruction-Level Parallelism ](#pipeliningandinstructionlevelparallelism) -3. [Deeper Pipelines – Superpipelining](#deeperpipelinessuperpipelining) +2. [Pipelining & Instruction-Level Parallelism 流水线技术与指令级并行性](#pipeliningandinstructionlevelparallelism) +3. [Deeper Pipelines – Superpipelining 更深的管道 —— 超级流水线](#deeperpipelinessuperpipelining) 4. [Multiple Issue – Superscalar](#multipleissuesuperscalar) 5. [Explicit Parallelism – VLIW](#explicitparallelismvliw) 6. [Instruction Dependencies & Latencies](#instructiondependenciesandlatencies) @@ -53,7 +53,6 @@ More Than Just Megahertz The first issue that must be cleared up is the difference between clock speed and a processor's performance. _They are not the same thing._ Look at the results for processors of a few years ago (the late 1990s)... - 第一个必须澄清的问题是时钟速度和处理器性能之间的区别。它们并不是同一回事。看看几年前(90年代末)处理器的表现 | | | SPECIT95 | SPECFP95 | @@ -66,13 +65,17 @@ The first issue that must be cleared up is the difference between clock speed an | 135 MHz | POWER2 | 6.2 | 17.6 | -
Table 1 – Processor performance circa 1997
+
Table 1 – Processor performance circa 1997
+ +
表 1 - 1997 年左右的处理器性能
A 200 MHz MIPS R10000, a 300 MHz UltraSPARC and a 400 MHz Alpha 21164 were all about the same speed at running most programs, yet they differed by a factor of two in clock speed. A 300 MHz Pentium II was also about the same speed for many things, yet it was about half that speed for floating-point code such as scientific number crunching. A PowerPC G3 at that same 300 MHz was somewhat faster than the others for normal integer code, but still far slower than the top three for floating-point. At the other extreme, an IBM POWER2 processor at just 135 MHz matched the 400 MHz Alpha 21164 in floating-point speed, yet was only half as fast for normal integer programs. + 200 MHz 的 MIPS R10000、300 MHz 的 UltraSPARC 和 400 MHz 的 Alpha 21164 处理器在运行大多数程序时,速度大致相同。但它们在时钟速度上相差2倍。300 MHz 的奔腾 II 处理器在处理大部分事情上也是大致相同的速度。但是对于浮点代码,比如科学数字运算,它的速度大约只有一半。同样 300 MHz 的 PowerPC G3 处理器在处理正常的整数代码方面略快于其他处理器。但浮点运算的速度,仍比排在前三的处理器慢得多。另一个极端是只有 135 MHz 的 IBM POWER2 处理器。它拥有与 400 MHz 的 Alpha 21164 不相上下的浮点运算速度。但运行普通的整数程序,速度只有一半。 How can this be? Obviously, there's more to it than just clock speed – it's all about how much work gets done in each clock cycle. Which leads to... + 这怎么可能呢?显然,时钟速度并不是全部,它还涉及到每个时钟周期完成了多少工作。这通向… Pipelining & Instruction-Level Parallelism @@ -82,53 +85,106 @@ Pipelining & Instruction-Level Parallelism Instructions are executed one after the other inside the processor, right? Well, that makes it easy to understand, but that's not really what happens. In fact, that hasn't happened since the middle of the 1980s. Instead, several instructions are all _partially executing_ at the same time. +指令在处理器内部一个接一个地执行,对吗?嗯,这很容易理解,但事实并非如此。实际上,自从20世纪80年代中期以来,这种情况还就不再发生。相反,几个指令是同时执行的。 + Consider how an instruction is executed – first it is fetched, then decoded, then executed by the appropriate functional unit, and finally the result is written into place. With this scheme, a simple processor might take 4 cycles per instruction (CPI = 4)... +考虑一下指令是如何执行的 —— 首先获取指令,然后对其进行解码,接着由适当的功能单元执行,最后将结果写入适当的位置。使用这个方案,一个简单的处理器处理一个指令可能需要 4 个周期(CPI = 4)…… + ![](sequential2.png) Figure 1 – The instruction flow of a sequential processor. +图1 —— 顺序处理器的指令流 + Modern processors overlap these stages in a _pipeline_, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched... +现代处理器将这些阶段在_流水线_上重叠,就像生产线一样。当一条指令正在执行时,下一条指令正在被解码,而后一条指令正在被获取…… + ![](pipelined2.png) Figure 2 – The instruction flow of a pipelined processor. +图2 —— 流水线处理器的指令流 + Now the processor is completing 1 instruction every cycle (CPI = 1). This is a four-fold speedup without changing the clock speed at all. Not bad, huh? +现在,处理器处理每条指令只需一个周期(CPI = 1)。这相当于四倍的加速,而不改变时钟速度。还不错,是吧? + From the hardware point of view, each pipeline stage consists of some combinatorial logic and possibly access to a register set and/or some form of high-speed cache memory. The pipeline stages are separated by latches. A common clock signal synchronizes the latches between each stage, so that all the latches capture the results produced by the pipeline stages at the same time. In effect, the clock "pumps" instructions down the pipeline. +从硬件的角度来看,每个流水线阶段都包含一些组合逻辑,还可能涉及到访问寄存器集和/或某种形式的高速缓存存储器。不同的管道阶段通过锁存器分离。各阶段之间的锁存器通过公共时钟信号同步,以便能够同时捕获流水线各阶段产生的结果。实际上,时钟扮演着将指令“泵”入管道的角色。 + At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch, and this information forms the inputs to the logic circuits of the next pipeline stage. During the clock cycle, the signals propagate through the combinatorial logic of the stage, producing an output just in time to be captured by the next pipeline latch at the end of the clock cycle... +在每个时钟周期的开始,经过部分处理的指令的相关数据和控制信息被保存在流水线锁存器中。该信息将作为输入信息进入下一个流水线阶段的逻辑电路中。在时钟周期期间,信号在该阶段的组合逻辑中传播,并按时产生输出,一遍使输出信息能在时钟周期结束时,被下一个流水线锁存器捕获…… + ![](pipelinedmicroarch2.png) Figure 3 – A pipelined microarchitecture. +图3 —— 流水线微架构 + Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines called _bypasses_ are added, going backwards along the pipeline... +因为每个指令的结果在完成执行阶段之后是可用的,所以下一个指令应该能够立即使用该值,而无需等待该结果在回写阶段被提交至目标寄存器。为了实现这一点,被称为_旁路_的转发行被加入架构中,沿着管道向后移动…… ![](pipelinedbypasses2.png) Figure 4 – A pipelined microarchitecture with bypasses. +图4 ——带旁路的流水线微架构 + Although the pipeline stages look simple, it is important to remember the _execute_ stage in particular is really made up of several different groups of logic (several sets of gates), making up different _functional units_ for each type of operation the processor must be able to perform... +虽然流水线各阶段看起来很简单,但重要的是要理解:_执行_阶段实际上是由不同的逻辑组(几组逻辑门)组成,他们形成了不同的_功能单元_使得处理器能够执行各种必须的操作...... + ![](pipelinedfunctionalunits2.png) Figure 5 – A pipelined microarchitecture in more detail. +图5 —— 更详细的流水线微架构 + The early RISC processors, such as IBM's 801 research prototype, the MIPS R2000 (based on the Stanford MIPS machine) and the original SPARC (derived from the Berkeley RISC project), all implemented a simple 5-stage pipeline not unlike the one shown above. At the same time, the mainstream 80386, 68030 and VAX CISC processors worked largely sequentially – it's much easier to pipeline a RISC because its _reduced instruction set_ means the instructions are mostly simple register-to-register operations, unlike the complex instruction sets of x86, 68k or VAX. As a result, a pipelined SPARC running at 20 MHz was way faster than a sequential 386 running at 33 MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in the [1985 CACM article](http://dl.acm.org/citation.cfm?id=214917) by David Patterson. +早期的 RISC 处理器,如 IBM 的 801 研究原型、MIPS R2000(基于斯坦福 MIPS 机器)和原始的 SPARC (源自伯克利 RISC 项目),实现了简单的,与上图并无不同的 5 级流水线。同时,主流的 80386、68030 和 VAX CISC 处理器基本上是顺序工作的 —— 流水线化 RISC 更容易。因为其_简化的指令集_意味着大多指令都是简单的寄存器到寄存器的操作,而不像 x86、68k 或 VAX,他们拥有复杂的指令集。这使得 20 MHz 的流水线 SPARC 比 33 MHz 的顺序 386 运行速度快得多。从那时起,每个处理器都被流水线化了,至少在某种程度上是如此。在 1985 年由 David Patterson 撰写的 [CACM 文章](http://dl.acm.org/citation.cfm?id=214917) 中可以找到对 RISC 原始研究项目的一个好的总结。 + + Deeper Pipelines – Superpipelining ---------------------------------- - +更深的管道 —— 超级流水线 +---------------------------------- Since the clock speed is limited by (among other things) the length of the longest, slowest stage in the pipeline, the logic gates that make up each stage can be _subdivided_, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. Then the whole processor can be run at a _higher clock speed!_ Of course, each instruction will now take more cycles to complete (latency), but the processor will still be completing 1 instruction per cycle (throughput), and there will be more cycles per second, so the processor will complete more instructions per second (actual performance)... +鉴于时钟速度受到(除了其他原因之外)流水线中最长、最慢的阶段的长度的限制,组成每个阶段的逻辑门可以被_细分_,尤其是那些较长的逻辑门,从而将流水线转换为具有更多更短阶段的深层超级流水线。这样,整个处理器就能够以_更高的时钟速度_运行!当然,每个指令会需要更多的周期来完成(延迟),但是处理器仍然会每周期完成一个指令(吞吐量),并且每秒会有更多的周期,所以处理器每秒将完成更多的指令(实际性能)…… + ![](superpipelined2.png) Figure 6 – The instruction flow of a superpipelined processor. +图6 —— 超级流水线处理器指令流 + The Alpha architects in particular liked this idea, which is why the early Alphas had deep pipelines and ran at such high clock speeds for their era. Today, modern processors strive to keep the number of gate delays down to just a handful for each pipeline stage, about 12-25 gates deep (not total!) plus another 3-5 for the latch itself, and most have quite deep pipelines... +Alpha 的架构师特别喜欢这个想法。这就是为什么早期的 Alpha 拥有深层管道,并在他们那个时代有如此高的时钟速度。如今,现代处理器努力将各阶段的门延迟保持在少数,大约 12-25 个门深(并非全部!)再加上 3-5 个闩锁本身。大部分处理器都有相当深的管道… + +| 流水线深度 | 处理器 | +| :-----: | :----: | +| 6 | UltraSPARC T1 | +| 7 | PowerPC G4e | +| 8 | UltraSPARC T2/T3, Cortex-A9 | +| 10 | Athlon, Scorpion | +| 11 | Krait | +| 12 | Pentium Pro/II/III, Athlon 64/Phenom, Apple A6 | +| 13 | Denver | +| 14 | UltraSPARC III/IV, Core 2, Apple A7/A8 | +| 14/19 | Core i*2/i*3 Sandy/Ivy Bridge, Core i*4/i*5 Haswell/Broadwell | +| 15 | Cortex-A15/A57 | +| 16 | PowerPC G5, Core i*1 Nehalem | +| 18 | Bulldozer/Piledriver, Steamroller | +| 20 | Pentium 4 | +| 31 | Pentium 4E Prescott | + + Pipeline Depth Processors