diff --git "a/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" "b/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" new file mode 100644 index 0000000..0340c0e --- /dev/null +++ "b/translation/#114 90 \345\210\206\351\222\237\345\255\246\344\271\240\347\216\260\344\273\243\345\276\256\345\244\204\347\220\206\345\231\250.md" @@ -0,0 +1,412 @@ +# Modern Microprocessors —— A 90-Minute Guide! +# 现代微处理器 之 90分钟学习指南 (上) + + +Table of Contents +----------------- +目录 +----------------- + +1. [More Than Just Megahertz 不仅仅是兆赫](#morethanjustmegahertz) +2. [Pipelining & Instruction-Level Parallelism 流水线技术与指令级并行性](#pipeliningandinstructionlevelparallelism) +3. [Deeper Pipelines – Superpipelining 更深的管道 - 超级流水线](#deeperpipelinessuperpipelining) +4. [Multiple Issue – Superscalar 多发技术 - 超标量体系结构](#multipleissuesuperscalar) +5. [Explicit Parallelism – VLIW 显式并行 - VLIW](#explicitparallelismvliw) +6. [Instruction Dependencies & Latencies 指令依赖与延迟](#instructiondependenciesandlatencies) +7. [Branches & Branch Prediction 分支与分支预测](#branchesandbranchprediction) + +* * * + +WARNING: This article is meant to be informal and fun! + +Okay, so you're a CS graduate and you did a hardware course as part of your degree, but perhaps that was a few years ago now and you haven't really kept up with the details of processor designs since then. + +In particular, you might not be aware of some key topics that developed rapidly in recent times... + +* pipelining (superscalar, OOO, VLIW, branch prediction, predication) +* multi-core and simultaneous multi-threading (SMT, hyper-threading) +* SIMD vector instructions (MMX/SSE/AVX, AltiVec, NEON) +* caches and the memory hierarchy + +Fear not! This article will get you up to speed _fast_. In no time, you'll be discussing the finer points of in-order vs out-of-order, hyper-threading, multi-core and cache organization like a pro. + +But be prepared – this article is brief and to-the-point. It pulls no punches and the pace is pretty fierce (really). Let's get into it... + +注意:这将是一篇有趣的科普文章 + +也许你曾经是计算机系的毕业生,并且学习过硬件的相关课程,然而几年过去了,你可能并没有保持对处理器设计细节的关注。 + +特别是以下这些最近飞速发展的关键话题,也许你并不了解: +* 流水线技术 (超标量体系,乱序执行,极长指令字,分支预测,预测) +* 多核技术和同步多线程 (SMT,超线程) +* SIMD 矢量指令 (MMX/SSE/AVX, AltiVec, NEON) +* 缓存和内存体系 + +不用担心!本文会帮你*迅速*赶上。很快的,你将可以像专家一样谈论各种技术细节,例如:有序与无序、超线程、多核处理器、缓存操作等 + +请做好准备 - 这是一篇简明扼要的文章。它将是毫无保留的,快节奏的。让我们开始吧…… + + +More Than Just Megahertz +------------------------ +不仅仅是兆赫 +------------------------ + +The first issue that must be cleared up is the difference between clock speed and a processor's performance. _They are not the same thing._ Look at the results for processors of a few years ago (the late 1990s)... + +第一个必须澄清的问题是时钟速度和处理器性能之间的区别。*它们并不是同一回事。* 看看几年前(90年代末)处理器的表现 + +| | | *SPECint95* | *SPECfp95* | +| :-----: | :----: | :----: | :----: | +| 195 MHz | MIPS R10000 | 11.0 | 17.0| +| 400 MHz | Alpha 21164 | 12.3 | 17.2| +| 300 MHz | UltraSPARC | 12.1 | 15.5| +| 300 MHz | Pentium II | 11.6 | 8.8| +| 300 MHz | PowerPC G3 | 14.8 | 11.4| +| 135 MHz | POWER2 | 6.2 | 17.6 | + + +Table 1 – Processor performance circa 1997 + + +表 1 —— 1997 年左右的处理器性能 + +A 200 MHz MIPS R10000, a 300 MHz UltraSPARC and a 400 MHz Alpha 21164 were all about the same speed at running most programs, yet they differed by a factor of two in clock speed. A 300 MHz Pentium II was also about the same speed for many things, yet it was about half that speed for floating-point code such as scientific number crunching. A PowerPC G3 at that same 300 MHz was somewhat faster than the others for normal integer code, but still far slower than the top three for floating-point. At the other extreme, an IBM POWER2 processor at just 135 MHz matched the 400 MHz Alpha 21164 in floating-point speed, yet was only half as fast for normal integer programs. + +200 MHz 的 MIPS R10000、300 MHz 的 UltraSPARC 和 400 MHz 的 Alpha 21164 处理器在运行大多数程序时,速度大致相同。但它们在时钟速度上相差2倍。300 MHz 的奔腾 II 处理器在处理大部分事情上也是大致相同的速度。但是对于浮点代码,比如科学数字运算,它的速度大约只有一半。同样 300 MHz 的 PowerPC G3 处理器在处理正常的整数代码方面略快于其他处理器。但浮点运算的速度,仍比排在前三的处理器慢得多。另一个极端是只有 135 MHz 的 IBM POWER2 处理器。它拥有与 400 MHz 的 Alpha 21164 不相上下的浮点运算速度。但运行普通的整数程序,速度只有一半。 + +How can this be? Obviously, there's more to it than just clock speed – it's all about how much work gets done in each clock cycle. Which leads to... + +这怎么可能呢?显然,时钟速度并不是全部,它还涉及到每个时钟周期完成了多少工作。这通向… + +Pipelining & Instruction-Level Parallelism +------------------------------------------ +流水线技术(Pipelining)与指令级并行性 +------------------------------------------ + +Instructions are executed one after the other inside the processor, right? Well, that makes it easy to understand, but that's not really what happens. In fact, that hasn't happened since the middle of the 1980s. Instead, several instructions are all _partially executing_ at the same time. + +指令在处理器内部一个接一个地执行,对吗?嗯,这很容易理解,但事实并非如此。实际上,自从20世纪80年代中期以来,这种情况还就不再发生。相反,几个指令是同时执行的。 + +Consider how an instruction is executed – first it is fetched, then decoded, then executed by the appropriate functional unit, and finally the result is written into place. With this scheme, a simple processor might take 4 cycles per instruction (CPI = 4)... + +考虑一下指令是如何执行的 —— 首先获取指令,然后对其进行解码,接着由适当的功能单元执行,最后将结果写入适当的位置。使用这个方案,一个简单的处理器处理一个指令可能需要 4 个周期(CPI = 4)…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/sequential2.png) + +Figure 1 – The instruction flow of a sequential processor. + +图1 —— 顺序处理器的指令流 + +Modern processors overlap these stages in a _pipeline_, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched... + +现代处理器将这些阶段在*流水线*上重叠,就像生产线一样。当一条指令正在执行时,下一条指令正在被解码,而后一条指令正在被获取…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/pipelined2.png) + +Figure 2 – The instruction flow of a pipelined processor. + +图2 —— 流水线处理器的指令流 + +Now the processor is completing 1 instruction every cycle (CPI = 1). This is a four-fold speedup without changing the clock speed at all. Not bad, huh? + +现在,处理器处理每条指令只需一个周期(CPI = 1)。这相当于四倍的加速,而不改变时钟速度。还不错,是吧? + +From the hardware point of view, each pipeline stage consists of some combinatorial logic and possibly access to a register set and/or some form of high-speed cache memory. The pipeline stages are separated by latches. A common clock signal synchronizes the latches between each stage, so that all the latches capture the results produced by the pipeline stages at the same time. In effect, the clock "pumps" instructions down the pipeline. + +从硬件的角度来看,每个流水线阶段都包含一些组合逻辑,还可能涉及到访问寄存器集和/或某种形式的高速缓存存储器。不同的管道阶段通过锁存器分离。各阶段之间的锁存器通过公共时钟信号同步,以便能够同时捕获流水线各阶段产生的结果。实际上,时钟扮演着将指令“泵”入管道的角色。 + +At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch, and this information forms the inputs to the logic circuits of the next pipeline stage. During the clock cycle, the signals propagate through the combinatorial logic of the stage, producing an output just in time to be captured by the next pipeline latch at the end of the clock cycle... + +在每个时钟周期的开始,经过部分处理的指令的相关数据和控制信息被保存在流水线锁存器中。该信息将作为输入信息进入下一个流水线阶段的逻辑电路中。在时钟周期期间,信号在该阶段的组合逻辑中传播,并按时产生输出,以遍使输出信息能在时钟周期结束时,被下一个流水线锁存器捕获…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/pipelinedmicroarch2.png) + +Figure 3 – A pipelined microarchitecture. + +图3 —— 流水线微架构 + +Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines called _bypasses_ are added, going backwards along the pipeline... + +因为每个指令的结果在完成执行阶段之后是可用的,所以下一个指令应该能够立即使用该值,而无需等待该结果在回写阶段被提交至目标寄存器。为了实现这一点,被称为 *旁路*的转发行被加入架构中,沿着管道向后移动…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/pipelinedbypasses2.png) + +Figure 4 – A pipelined microarchitecture with bypasses. + +图4 —— 带旁路的流水线微架构 + +Although the pipeline stages look simple, it is important to remember the _execute_ stage in particular is really made up of several different groups of logic (several sets of gates), making up different _functional units_ for each type of operation the processor must be able to perform... + +虽然流水线各阶段看起来很简单,但重要的是要理解: *执行*阶段实际上是由不同的逻辑组(几组逻辑门)组成,他们形成了不同的*功能单元*使得处理器能够执行各种必须的操作...... + +![](http://www.lighterra.com/papers/modernmicroprocessors/pipelinedfunctionalunits2.png) + +Figure 5 – A pipelined microarchitecture in more detail. + +图5 —— 更详细的流水线微架构 + +The early RISC processors, such as IBM's 801 research prototype, the MIPS R2000 (based on the Stanford MIPS machine) and the original SPARC (derived from the Berkeley RISC project), all implemented a simple 5-stage pipeline not unlike the one shown above. At the same time, the mainstream 80386, 68030 and VAX CISC processors worked largely sequentially – it's much easier to pipeline a RISC because its _reduced instruction set_ means the instructions are mostly simple register-to-register operations, unlike the complex instruction sets of x86, 68k or VAX. As a result, a pipelined SPARC running at 20 MHz was way faster than a sequential 386 running at 33 MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in the [1985 CACM article](http://dl.acm.org/citation.cfm?id=214917) by David Patterson. + +早期的 RISC 处理器,如 IBM 的 801 研究原型、MIPS R2000(基于斯坦福 MIPS 机器)和原始的 SPARC (源自伯克利 RISC 项目),实现了简单的,与上图并无不同的 5 级流水线。同时,主流的 80386、68030 和 VAX CISC 处理器基本上是顺序工作的 —— 流水线化 RISC 更容易。因为其*简化的指令集*意味着大多指令都是简单的寄存器到寄存器的操作,而不像 x86、68k 或 VAX,他们拥有复杂的指令集。这使得 20 MHz 的流水线 SPARC 比 33 MHz 的顺序 386 运行速度快得多。从那时起,每个处理器都被流水线化了,至少在某种程度上是如此。1985 年由 David Patterson 撰写的 [CACM 文章](http://dl.acm.org/citation.cfm?id=214917) 对原始 RISC 研究项目进行了不错的总结。 + + +Deeper Pipelines – Superpipelining +---------------------------------- +更深的管道 —— 超级流水线 +---------------------------------- +Since the clock speed is limited by (among other things) the length of the longest, slowest stage in the pipeline, the logic gates that make up each stage can be _subdivided_, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. Then the whole processor can be run at a _higher clock speed!_ Of course, each instruction will now take more cycles to complete (latency), but the processor will still be completing 1 instruction per cycle (throughput), and there will be more cycles per second, so the processor will complete more instructions per second (actual performance)... + +鉴于时钟速度受到(除了其他原因之外)流水线中最长、最慢的阶段的长度限制,组成每个阶段的逻辑门可以被*细分*,尤其是那些较长的逻辑门,从而将流水线转换为具有更多更短阶段的深层超级流水线。这样,整个处理器就能够以*更高的时钟速度*运行!当然,每个指令会需要更多的周期来完成(延迟),但是处理器仍然保持每周期完成一个指令(吞吐量),并且每秒会有更多的周期,所以处理器每秒将完成更多的指令(实际性能)…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/superpipelined2.png) + +Figure 6 – The instruction flow of a superpipelined processor. + +图6 —— 超级流水线处理器指令流 + +The Alpha architects in particular liked this idea, which is why the early Alphas had deep pipelines and ran at such high clock speeds for their era. Today, modern processors strive to keep the number of gate delays down to just a handful for each pipeline stage, about 12-25 gates deep (not total!) plus another 3-5 for the latch itself, and most have quite deep pipelines... + +Alpha 的架构师特别喜欢这个想法。这就是为什么早期的 Alpha 拥有深层管道,并在他们那个时代有如此高的时钟速度。如今,现代处理器努力将各阶段的门延迟保持在少数,大约 12-25 个门深(并非全部!)再加上 3-5 个闩锁本身。大部分处理器都有相当深的管道… + +| Pipeline Depth 流水线深度 | Processors 处理器 | +| :-----: | :----: | +| 6 | UltraSPARC T1 | +| 7 | PowerPC G4e | +| 8 | UltraSPARC T2/T3, Cortex-A9 | +| 10 | Athlon, Scorpion | +| 11 | Krait | +| 12 | Pentium Pro/II/III, Athlon 64/Phenom, Apple A6 | +| 13 | Denver | +| 14 | UltraSPARC III/IV, Core 2, Apple A7/A8 | +| 14/19 | Core i×2/i×3 Sandy/Ivy Bridge, Core i×4/i×5 Haswell/Broadwell | +| 15 | Cortex-A15/A57 | +| 16 | PowerPC G5, Core i×1 Nehalem | +| 18 | Bulldozer/Piledriver, Steamroller | +| 20 | Pentium 4 | +| 31 | Pentium 4E Prescott | + +Table 2 – Pipeline depths of common processors. + +表2 —— 普通处理器的流水线深度 + +The x86 processors generally have deeper pipelines than the RISCs (of comparable era) because they need to do extra work to decode the complex x86 instructions (more on this later). UltraSPARC T1/T2/T3 Niagara are a recent exception to the deep-pipeline trend – just 6 for UltraSPARC T1 and 8 for T2/T3 to keep those cores as small as possible (more on this later, too). + +x86 处理器通常比(同时代)RISC 具有更深的流水线,因为它们需要额外的工作来解码复杂的 x86 指令(稍后将详细介绍)。UltraSPARC T1/T2/T3 Niagara 是最近深层管道趋势的一个例外 —— 为了保持这些核尽可能小,UltraSPARC T1 只有 6 层,T2、T3 只有 8 层(稍后将详细介绍)。 + + +Multiple Issue – Superscalar +---------------------------- +多发技术 - 超标量体系结构 +---------------------------- + +Since the execute stage of the pipeline is really a bunch of different _functional units_, each doing its own task, it seems tempting to try to execute multiple instructions _in parallel_, each in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"... + +流水线的执行阶段实际上是一组不同的*功能单元*各自执行自己的任务,由此一个诱人的设想产生了,即多个命令各自在自己的功能单元中*同时*执行。为此,必须增强提取阶段及解码、分派阶段,以便它们能够并行解码多个指令,并将它们发送到“执行资源”…… + +![](http://www.lighterra.com/papers/modernmicroprocessors/superscalarmicroarch2.png) + +Figure 7 – A superscalar microarchitecture. + +图7 —— 超标量微体系结构 + +Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages. This allows the simpler instructions to complete more quickly, reducing _latency_ (which we'll get to soon). Since such processors have many different pipeline depths, it's normal to refer to the depth of a processor's pipeline when executing _integer_ instructions, which is usually the shortest of the possible pipeline paths, with the memory and floating-point pipelines implied as having a few additional stages. Thus, a processor with a "10-stage pipeline" would use 10 stages for executing integer instructions, perhaps 12 or 13 stages for memory instructions, and maybe 14 or 15 stages for floating-point. There are also a bunch of bypasses within and between the various pipelines, but these have been left out of the diagram for simplicity. + +当然,既然每个功能单元都有了独立的管道,那么它们甚至可以具有不同的阶段数。这将能使简单的指令更快地完成,从而减少*延迟*(我们将很快讨论这个问题)。由于这些处理器具有许多不同的流水线深度,所以在提到这些处理器的流水线深度时,通常指处理器在执行*整型*指令时的深度,因为整数指令通常是可能的流水线路径中最短的,而内存和浮点流水线则可能有一些附加的阶段。因此,具有 ”10 级流水线“ 的处理器将使用 10 个阶段来执行整型指令,可能用 12 或 13 个阶段用于存储器指令,14 或 15 个阶段用于浮点指令。在各个管道内和管道之间也有一些旁路,但是为了简单起见,在图中省略了这些旁路。 + + +In the above example, the processor could potentially issue 3 different instructions per cycle – for example 1 integer, 1 floating-point and 1 memory instruction. Even more functional units could be added, so that the processor might be able to execute 2 integer instructions per cycle, or 2 floating-point instructions, or whatever the target applications could best use. + +在上面的例子中,处理器可能每个周期发出 3 个不同的指令 —— 例如 1 个整型、1 个浮点型和 1 个内存指令。甚至还可以添加更多的功能单元,以便处理器能够在一个周期内执行 2 个整型指令,或 2 个浮点指令,或者最适合目标应用程序的任何指令。 + + +On a superscalar processor, the instruction flow looks something like... + +在超标量处理器上,指令流看起来像… + + +![](http://www.lighterra.com/papers/modernmicroprocessors/superscalar2.png) + +Figure 8 – The instruction flow of a superscalar processor. +图8 —— 超标量处理器指令流 + +This is great! There are now 3 instructions completing every cycle (CPI = 0.33, or IPC = 3, also written as ILP = 3 for _instruction-level parallelism_). The number of instructions able to be issued, executed or completed per cycle is called a processor's _width_. + +这太棒了!现在每个周期有 3 条指令完成(CPI = 0.33,或 IPC = 3,也可写成 ILP = 3,用于*指令级并行*)。每个周期能够发出、执行或完成的指令数称为处理器*宽度*。 + +Note that the issue width is less than the number of functional units – this is typical. There must be more functional units because different code sequences have different mixes of instructions. The idea is to execute 3 instructions per cycle, but those instructions are not always going to be 1 integer, 1 floating-point and 1 memory operation, so more than 3 functional units are required. + +注意,分发宽度小于功能单元的个数 —— 这是典型的。不同的代码序列具有不同的指令组,因此需要更多的功能单元。我们希望每个周期可以运行 3 个指令,但是这些指令并不总是一个整数、一个浮点和一个内存操作,所以往往需要多于 3 个的功能单元。 + +The IBM POWER1 processor, the predecessor of PowerPC, was the first mainstream superscalar processor. Most of the RISCs went superscalar soon after (SuperSPARC, Alpha 21064). Intel even managed to build a superscalar x86 – the original Pentium – however the complex x86 instruction set was a real problem for them (more on this later). + +IBM POWER1 处理器,PowerPC 的前身,是第一个主流的超标量处理器。大部分 RISC 也紧随其后进入超标量时代(SuperSPARC, Alpha 21064)。英特尔甚至设法构建了一个超标量 x86 —— 奔腾的原型 —— 但是复杂的 x86 指令集对他们来说确实是个问题(稍后将详细介绍)。 + +Of course, there's nothing stopping a processor from having both a deep pipeline and multiple instruction issue, so it can be both superpipelined and superscalar at the same time... + +当然,没有任何事情可以阻止处理器同时具有深流水线和多指令分发,因此处理器可以同时是超流水线和超标量的…… + + +![](http://www.lighterra.com/papers/modernmicroprocessors/superpipelinedsuperscalar2.png) + +Figure 9 – The instruction flow of a superpipelined-superscalar processor. + +图9 —— 超流水线、超标量处理器指令流 + +Today, virtually every processor is a superpipelined-superscalar, so they're just called superscalar for short. Strictly speaking, superpipelining is just pipelining with a deeper pipe anyway. + +如今,实际上每个处理器都是超流水线-超标量的,因此它们被简称为超标量。严格来说,超级流水线只是用更深的管道进行流水线操作。 + +The widths of modern processors vary considerably... + +现代处理器的带宽相差很大…… + +| Issue Width 分发宽度 | Processors 处理器 | +| :-----: | :----: | +| 1 | UltraSPARC T1 | +| 2 | UltraSPARC T2/T3, Scorpion, Cortex-A9 | +| 3 | Pentium Pro/II/III/M, Pentium 4, Krait, Apple A6, Cortex-A15/A57 | +| 4 | UltraSPARC III/IV, PowerPC G4e | +| 4/8 | Bulldozer/Piledriver, Steamroller | +| 5 | PowerPC G5 | +| 6 | Athlon, Athlon 64/Phenom, Core 2, Core i×1 Nehalem, Core i×2/i×3 Sandy/Ivy Bridge, Apple A7/A8 | +| 7 | Denver | +| 8 | Core i×4/i×5 Haswell/Broadwell | + +Table 3 – Issue widths of common processors. + +表3 —— 通用处理器的分发宽度 + +The exact number and type of functional units in each processor depends on its target market. Some processors have more floating-point execution resources (IBM's POWER line), others are more integer-biased (Pentium Pro/II/III/M), some devote much of their resources to SIMD vector instructions (PowerPC G4/G4e), while most try to take the "balanced" middle ground. + +每个处理器中功能单元的确切数量和类别取决于它的目标市场。一些处理器具有更多的浮点处理资源(IBM 的 POWER line ),有些更偏重整型(Pentium Pro/II/II/M),还有一些则把它们的资源投入到 SIMD 矢量指令(PowerPC G4/G4E)中,而大多数处理器则尝试采用“平衡”化的配置。 + +Explicit Parallelism – VLIW +--------------------------- +显式并行 - VLIW +--------------------------- + +In cases where backward compatibility is not an issue, it is possible for the _instruction set_ itself to be designed to _explicitly_ group instructions to be executed in parallel. This approach eliminates the need for complex dependency-checking logic in the dispatch stage, which should make the processor easier to design, smaller, and easier to ramp up the clock speed over time (at least in theory). + +在无需考虑向前兼容的情况下,*指令集*本身可以设计成*显式*地将并行执行的指令分组。这种方法消除了在调度阶段对复杂的依赖性进行检查的需要。这应该使处理器更容易设计、更小,并且更容易随着时间推移提高时钟速度(至少在理论上)。 + +In this style of processor, the "instructions" are really _groups_ of little sub-instructions, and thus the instructions themselves are very long, often 128 bits or more, hence the name VLIW – _very long instruction word_. Each instruction contains information for multiple parallel operations. + +在这样的处理器中,“指令”实际上是小的子指令的*集合*,因此指令本身非常长,一般为 128 比特或更多,因此得名为 VLIW —— *非常长的指令字*。每个指令包含多个并行操作的信息。 + +A VLIW processor's instruction flow is much like a superscalar, except the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions... +VLIW 处理器的指令流非常类似于超标量,只是解码/分派阶段要简单得多,并且只发生在一组子指令中... + +![](http://www.lighterra.com/papers/modernmicroprocessors/vliw2.png) + +Figure 10 – The instruction flow of a VLIW processor. + +图10 —— VLIW 处理器指令流 + +Other than the simplification of the dispatch logic, VLIW processors are much like superscalar processors. This is especially so from a compiler's point of view (more on this later). + +除了调度逻辑的简化之外,VLIW 处理器非常类似于超标量处理器。对于编译器来说,它们尤为相似(稍后详谈)。 + +It is worth noting, however, that most VLIW designs are _not interlocked_. This means they do not check for dependencies between instructions, and often have no way of stalling instructions other than to stall the whole processor on a cache miss. As a result, the compiler needs to insert the appropriate number of cycles between dependent instructions, even if there are no instructions to fill the gap, by using _nops_ (no-operations, pronounced "no ops") if necessary. This complicates the compiler somewhat, because it is doing something that a superscalar processor normally does at runtime, however the extra code in the compiler is minimal and it saves precious resources on the processor chip. + +然而,大多数 VLIW *没有互锁*设计,这使得它们的价值大打折扣。这意味着它们不检查指令之间的依赖关系,并且当缓存丢失时,没有任何方法能停止一个指令,除非停止整个处理器。因此,编译器需要在互相依赖的指令之间插入适当数量的周期。甚至在需要的时候,还会用*空指令* 来填补周期中的空白。这让编译器有些复杂,因为它做的是超标量处理器通常在运行时做的事情。但是编译器中需要的额外代码非常少,并且节省了处理器芯片上的宝贵资源。 + + +No VLIW designs have yet been commercially successful as mainstream CPUs, however Intel's IA-64 architecture, which is still in production in the form of the Itanium processors, was once intended to be the replacement for x86. Intel chose to call IA-64 an "EPIC" design, for "explicitly parallel instruction computing", but it was essentially a VLIW with clever grouping (to allow long-term compatibility) and predication (see below). The programmable shaders in graphics processors (GPUs) are sometimes VLIW designs, as are many digital signal processors (DSPs), and there was also Transmeta (see the x86 section, coming up soon). + +目前还没有任何 VLIW 设计能像主流的 CPU 一样获得商业上成功。然而英特尔的 IA -64 构架(目前仍然以 Itanium 处理器的形式生产)曾经试图取代 x86。英特尔选择称 IA-64 为 “EPIC” 设计,用于 “显式并行指令计算”。但它本质上是一个具有巧妙分组(以实现长期兼容性)和预测(见下文)的 VLIW。图形处理器(GPU)中的可编程着色器有时是 VLIW 设计,许多数字信号处理器(DSP)也是如此,还有 Transmeta(参见即将发布的 x86 部分)。 + +Instruction Dependencies & Latencies +------------------------------------ +指令依赖与延迟 +------------------------------------ + +How far can pipelining and multiple issue be taken? If a 5-stage pipeline is 5 times faster, why not build a 20-stage superpipeline? If 4-issue superscalar is good, why not go for 8-issue? For that matter, why not build a processor with a 50-stage pipeline which issues 20 instructions per cycle? + +流水线和多发技术能走多远?如果 5 级流水线能提升 5 倍的速度,为什么不建造 20 级超级流水线呢?如果 4 并发超标量能有好的效果,为什么不采用 8 并发呢?如此说来,为什么不构建一个拥有 50 级流水线,每个周期处理 20 个指令的处理器呢? + +Well, consider the following two instructions... + +考虑以下两个指令… + + a = b * c; + d = a + 1; + +The second instruction _depends_ on the first – the processor can't execute the second instruction until after the first has completed calculating its result. This is a serious problem, because instructions that depend on each other cannot be executed in parallel. Thus, multiple issue is impossible in this case. + +第二条指令*依赖*于第一条指令 —— 在第一条指令计算出结果之前,处理器无法执行第二条指令。这是一个严重的问题,因为相互依赖的指令不能并行执行。因此,在这种情况下,多发技术并不适用。 + +If the first instruction was a simple integer addition then this might still be okay in a pipelined _single-issue_ processor, because integer addition is quick and the result of the first instruction would be available just in time to feed it back into the next instruction (using bypasses). However in the case of a multiply, which will take several cycles to complete, there is no way the result of the first instruction will be available when the second instruction reaches the execute stage just one cycle later. So, the processor will need to stall the execution of the second instruction until its data is available, inserting a _bubble_ into the pipeline where no work gets done. + +如果第一条指令是简单的整数加法,那么在流水线*单发*处理器中并不会有问题。因为整数加法很快,第一条指令的结果能够及时返回,以供下一条指令使用(采用旁路)。然而,乘法需要几个周期才能完成。当第二条指令在仅一个周期后到达执行阶段时,第一条指令的结果还未返回。因此,处理器需要暂停第二条指令的执行,直到它需要的数据可用。此时,流水线会在没有指令的地方插入一个气泡。 + +It can be confusing when the word "latency" is used for related, but different, meanings. Here, I'm talking about the latency as seen by a compiler. Some hardware engineers may think of latency as the number of cycles required for execution (the number of pipeline stages). So a hardware engineer might say the instructions in a simple integer pipeline have a latency of 5 but a throughput of 1, whereas from a compiler's point of view they have a latency of 1 because their results are available for use in the very next cycle. The compiler view is the more common, and is generally used even in hardware manuals. + +使用“延迟”来表示一个与之相关但略有不同的含义,也许会带来一些困惑。在这里,我指的是编译器中的延迟。一些硬件工程师可能认为延迟是指令执行所需的周期数(流水线阶段数)。因此,硬件工程师可能会说,指令在简单的整型流水线中延迟为 5,吞吐量为 1。而从编译器的角度来看,它们的延迟为 1,因为它们的结果可在下一个周期中使用。从编译器的角度来描述更为常见,甚至在硬件手册中也普遍使用。 + +The number of cycles between when an instruction reaches the execute stage and when its result is available for use by other instructions is called the instruction's _latency_. The deeper the pipeline, the more stages and thus the longer the latency. So a very deep pipeline is not much more effective than a short one, because a deep one just gets filled up with bubbles thanks to all those nasty instructions depending on each other. + +指令的延迟指一条指令从到达执行阶段起至它的结果可供其他指令使用为止所经过的周期数。流水线越深,阶段越多,延迟也就越久。因此,一条很深的流水线并不比一条较短的流水线更有效,因为指令间烦人的依赖关系使得深层流水线之中充满气泡。 + +From a compiler's point of view, typical latencies in modern processors range from a single cycle for integer operations, to around 3-6 cycles for floating-point addition and the same or perhaps slightly longer for multiplication, through to over a dozen cycles for integer division. + +从编译器的角度来看,现代处理器的典型延迟范围一般包括整数运算的单个周期,浮点加法的大约 3 - 6 个周期,乘法大约相同或稍长的周期,以及整数除法的十几个周期。 + +Latencies for memory loads are particularly troublesome, in part because they tend to occur early within code sequences, which makes it difficult to fill their delays with useful instructions, and equally importantly because they are somewhat unpredictable – the load latency varies a lot depending on whether the access is a cache hit or not (we'll get to caches later). + +内存加载的延迟特别麻烦,一部分原因是因为它们往往出现在代码序列的早期,因而很难用有用的指令来填充这段延时。同样重要的是,它们还有些不可预测 —— 负载的延迟很大程度上取决于访问是否为缓存命中,因而变化很大(稍后详谈缓存)。 + + +Branches & Branch Prediction +---------------------------- +分支与分支预测 +---------------------------- + +Another key problem for pipelining is branches. Consider the following code sequence... +流水线的另一个关键问题是分支。例如,以下代码序列: + + if (a > 7) { + b = c; + } else { + b = d; + } + +...which compiles into something like... + +将被编译为: + + cmp a, 7 ; a > 7 ? + ble L1 + mov c, b ; b = c + br L2 + L1: mov d, b ; b = d + L2: ... + +Now consider a pipelined processor executing this code sequence. By the time the conditional branch at line 2 reaches the execute stage in the pipeline, the processor must have already fetched and decoded the next couple of instructions. But _which_ instructions? Should it fetch and decode the _if_ branch (lines 3 and 4) or the _else_ branch (line 5)? It won't really know until the conditional branch gets to the execute stage, but in a deeply pipelined processor that might be several cycles away. And it can't afford to just wait – the processor encounters a branch every six instructions on average, and if it was to wait several cycles at every branch then most of the performance gained by using pipelining in the first place would be lost. + +想象一下流水线处理器处理以上代码序列的过程。当第 2 行的条件分支到达流水线中的执行阶段时,处理器必须已经获取并解码了后面的一些指令。*哪些*指令呢?是应该提取并解码 *if* 分支(第 3 行和第 4 行)呢还是 *else* 分支(第 5 行)呢?这个问题,直到条件分支到达执行阶段,才能真正决定。但是在一个深度流水线的处理器中,这可能已经过去了几个周期。这样等待的代价是不能接受的 —— 处理器平均每六条指令就会遇到一个分支,如果它在每个分支上都等待几个周期,那么大多数情况下,首选流水线的优势将不复存在。 + +So the processor must make a _guess_. The processor will then fetch down the path it guessed and _speculatively_ begin executing those instructions. Of course, it won't be able to actually commit (writeback) those instructions until the outcome of the branch is known. Worse, if the guess is wrong the instructions will have to be cancelled, and those cycles will have been wasted. But if the guess is correct, the processor will be able to continue on at full speed. + +所以处理器必须做出*猜测*, 然后提取它所猜测的路径,并*试探性*地开始执行这些指令。当然,在得到条件分支的结果之前,它将无法实际提交(回写)那些指令。更糟糕的是,如果猜错了分支,那么那些指令就需要被取消,相当于浪费了那些周期。但如果猜测是正确的,处理器就可以保持全速运行。 + +The key question is _how_ the processor should make the guess. Two alternatives spring to mind. First, the _compiler_ might be able to mark the branch to tell the processor which way to go. This is called _static branch prediction_. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted not-taken. More importantly, however, this approach requires the compiler to be quite smart in order for it to make the correct guess, which is easy for loops but might be difficult for other branches. + +问题的关键在于处理器*如何*进行猜测。有两种方式可供选择:首先,*编译器*或许能够标记分支,以此告诉处理器该选择哪条分支。这称为*静态分支预测*。最理想的是在指令中用 1 比特的数据储存预测结果。但这并不适用于比较古老的体系构架。因此,惯常的做法是把向后分支预测为被采取的分支,把向前分支预测为不采取的分支。然而,更重要的是,这种方法要求编译器非常聪明,以便进行正确的猜测,这对于循环分支来说很容易,但对于其他分支可能很困难。 + +The other alternative is to have the processor make the guess _at runtime_. Normally, this is done by using an on-chip _branch prediction table_ containing the addresses of recent branches and a bit indicating whether each branch was taken or not last time. In reality, most processors actually use two bits, so that a single not-taken occurrence doesn't reverse a generally taken prediction (important for loop back edges). Of course, this dynamic branch prediction table takes up valuable space on the processor chip, but branch prediction is so important that it's well worth it. + +另一种选择是让处理器在*运行*时行猜测。通常情况下,这是通过芯片上内置的*分支预测表*来完成的。分支预测表中包含最近执行的分支地址,以及 1 比特指示每个分支上次是否执行的标记。实际上,大多数处理器使用 2 比特,因此单个未执行的事件不会影响到通常被执行的预测(这对于循环回边非常重要)。当然,这个动态的分支预测表占用了处理器芯片上的宝贵空间,但是分支预测非常重要,这是非常值得的。 + +Unfortunately, even the best branch prediction techniques are sometimes wrong, and with a deep pipeline many instructions might need to be cancelled. This is called the _mispredict penalty_. The Pentium Pro/II/III was a good example – it had a 12-stage pipeline and thus a mispredict penalty of 10-15 cycles. Even with a clever dynamic branch predictor that correctly predicted an impressive 90% of the time, this high mispredict penalty meant about 30% of the Pentium Pro/II/III's performance was lost due to mispredictions. Put another way, one third of the time the Pentium Pro/II/III was not doing useful work, but instead was saying "oops, wrong way". + +不幸的是,即使是最好的分支预测技术也可能预测错误。对于深层流水线来说,这将造成许多指令的取消。这被称为*预测失误惩罚*。以奔腾 Pro/II/II 为例, 它拥有 12 级的流水线,因此预测失误惩罚是 10 - 15 个周期。 即使采用正确率 90% 的动态分支预测器,如此高的预测失误惩罚也会造成约 30% 的性能损失。换句话说,三分之一的时间,奔腾 Pro/II/III 没有做有用的工作,而是在做错误的尝试。 + +Modern processors devote ever more hardware to branch prediction in an attempt to raise the prediction accuracy even further, and reduce this cost. Many record each branch's direction not just in isolation, but in the context of the couple of branches leading up to it, which is called a _two-level adaptive_ predictor. Some keep a more global branch history, rather than a separate history for each individual branch, in an attempt to detect any correlations between branches even if they're relatively far away in the code. That's called a _gshare_ or _gselect_ predictor. The most advanced modern processors often implement _several_ branch predictors and select between them based on which one seems to be working best for each individual branch! + +现代处理器投入了甚至更多的硬件来进行分支预测,期望可以提高预测的正确率,以便减少错误消耗。许多算法不只孤立的记录每个分支的方向,同时还记录了通向当前分支的一组分支。这种算法被称为*两级自适应预测器*。还有些算法为了探索分支之间的相互关联(即使他们在代码中相隔比较远),保存了更全局的分支历史,而不是单独记录每个分支的历史。这叫做 *GShare 或 GSelect 预测器*。最先进的现代处理器通常内置*多个*分支预测器,并针对每个单独的分支,在它们之间选择表现最好的那个。 + + +Nonetheless, even the very best modern processors with the best, smartest branch predictors only reach a prediction accuracy of about 95%, and still lose quite a lot of performance due to branch mispredictions. The bottom line is simple – very deep pipelines naturally suffer from _diminishing returns_, because the deeper the pipeline, the further into the future you must try to predict, the more likely you'll be wrong, and the greater the mispredict penalty when you are. + +尽管如此,即使是拥有最好、最智能的分支预测器的最优秀的现代处理器,也只能达到 95% 左右的预测精度,依然会因为分支预测失误,损失相当多的性能。底线很简单 —— 太深的流水线一般会受到*收益递减*的影响。因为流水线越深,需要预测出的指令就会越多,因而出错的可能性会越大,出错时,预测失误的惩罚也就越大。